Abstract
In this work, we explore the way to perform named entity recognition (NER) using only unlabeled data and named entity dictionaries. To this end, we formulate the task as a positive-unlabeled (PU) learning problem and accordingly propose a novel PU learning algorithm to perform the task. We prove that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data. A key feature of the proposed method is that it does not require the dictionaries to label every entity within a sentence, and it even does not require the dictionaries to label all of the words constituting an entity. This greatly reduces the requirement on the quality of the dictionaries and makes our method generalize well with quite simple dictionaries. Empirical studies on four public NER datasets demonstrate the effectiveness of our proposed method. We have published the source code at \url{https://github.com/v-mipeng/LexiconNER}.
Highlights
Named Entity Recognition (NER) is concerned with identifying named entities, such as person, location, product and organization names in unstructured text
We explore the way to perform named entity recognition (NER) using only unlabeled data and named entity dictionaries, which are relatively easier to obtain compared with labeled data
We evaluate the effectiveness of our proposed method on four NER datasets
Summary
Named Entity Recognition (NER) is concerned with identifying named entities, such as person, location, product and organization names in unstructured text. When using the dictionary to perform data labeling, we can only obtain some entity words and a bunch of unlabeled data comprising of both entity and non-entity words In this case, the conventional supervised or semi-supervised learning algorithms are not suitable, since they usually require labeled data of all classes. Since words labeled by the dictionary only cover part of entities, it cannot fully reveal data distribution of entity words To deal with this problem, we propose an adapted method, motivated by the AdaSampling algorithm (Yang et al, 2017), to enrich the dictionary. Contributions of this work can be summarized as follows: 1) We proposed a novel PU learning algorithm to perform the NER task using only unlabeled data and named entity dictionaries. 2) We proved that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data, under the assumption that the entities found out by the dictionary can reveal the distribution of entities. 3) To make the above assumption hold as far as possible, we propose an adapted method, motivated by the AdaSampling algorithm, to enrich the dictionary. 4) We empirically prove the effectiveness of our proposed method with extensive experimental studies on four NER datasets
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.