Abstract

Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement).

Highlights

  • We provide an expertlabeled, chemistry named entity recognition (NER) dataset with 62 fineother hand, the domain-specific ontologies and knowledge bases (KBs) can be accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER

  • We provide an expert-labeled, chemistry NER dataset with 62 finegrained chemistry types

  • A chemistry type ontology and associated entity dictionaries collected from the KBs, we develop a novel flexible KB-matching method with TF-IDF-based

Read more

Summary

Introduction

Named entity recognition (NER) is a fundamental step in scientific literature analysis to build AI-driven systems for molecular discovery, synthetic strategy designing, and manufacturing (Xie (1) incomplete annotation where a mention in a document can be matched only partially or missed completely due to an incomplete coverage of the KBs (Figure 1a), and (2) noisy annotation where a mention can be erroneously matched due to the potential matching of multiple entity types in the KBs (e.g., nested naming structures and long chemicalProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5227–5240 November 7–11, 2021. c 2021 Association for Computational LinguisticsINORGANIC PHOSPHORUS COMPOUNDS COUPLING REACTIONSWIiNthOtRhGeAsNeICchPiHraOlSnPuHcOleRoUpShCiOleMs,PSOuUzNuDkiS-MiyCaOuUrPaLcINroGsRs-EcAoCuTpIOliNnSg reactions were Wcaitrhritehdeosuetcwhiirtahlvnauricoleuos pahryile- sa,nSduhzuetkai-rMylicyhaluoraidcersoisns-gcoooudptloinegxrceeallcetniot nyisewldes.re carried out with various aryl- and hetarylCcHhLlOoRrIiDdEeSs in good to excellent yields.CHLORIDECSHLORIDES (a) IncomCHpLOleRtIDeEAS nnotation OXOACIDSOXOACIDS CATALYSTS, TRANSITION METALSTbAitheliteishnnobgoouutrgseohewndiitocinwratchtahisydensthleaibacrrteeastprhsyrae.erdcyortomosisen-mcaonputllopyylainarygsl tpworiiotcRhchEeiAooeCnmdTlyeIeVdftEor_iiunIcNrtqTvhEuieRnaMynpltrEbieDtosyIAeroTonEfncSpie,caCoalHlfcaLiaOddRsiwuIDimdEeS, array of functional groups. FUNCTIONAL GROUPS (b) Noisy Annotation tion problem. We develop a novel ontologyguided multi-type disambiguation method to resolve the noisy annotation problem. Taking the output from the above two steps as distant supervision, we further train a sequence labeling model to cover additional entities. CHEMNER significantly improves the distant label generation for the subsequent NER model training. We provide an expert-labeled, chemistry NER dataset with 62 finegrained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that CHEMNER is highly effective, achieving substantially better performance (with .25 absolute F1 score improvement) compared with the state-offormulas) of chemical entities, these challenges the-art NER methods. We have released our data and code to benefit future studies

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call