Abstract

Information extraction is a crucial technology to construct a knowledge base. In this paper, a novel model was proposed to extract entities and relations from plain text. This model consists of two components: a joint network and a reinforcement learning agent. The joint network is designed end-to-end, which can extract entities and relations simultaneously. In the joint model, a new tagging scheme was adopted, then the entity and relation extraction can be modeled as a joint sequence tagging problem. To enhance the robustness of the model, we also introduced a reinforcement learning (RL) agent to remove the noisy data from the training dataset. The RL agent aims at determining whether a candidate instance should be removed from the training dataset or not. When the agent completes a selection process, the training dataset will be divided into two parts: clean data and noisy data. Then the joint network can be trained again on the clean dataset to generate a better model. To assess the validity of the model we proposed, extensive experiments were conducted on the New York Times dataset (NYT10 and NYT 11). The experimental results showed that the model we proposed is superior compared with the baselines, achieving the F1 value on NYT10 and NYT11 with 0.612 and 0.549, respectively.

Highlights

  • Information extraction is a fundamental task in natural language processing (NLP), which can facilitate many other tasks, including knowledge base construction, question answering, and automatic text summarization

  • DATASETS AND METRICS Datasets: We evaluated the model on the New York Times (NYT) corpus, which is developed by distant supervision and contains noisy data

  • The novel model is composed of two parts: the joint network and reinforce learning agent

Read more

Summary

Introduction

Information extraction is a fundamental task in natural language processing (NLP), which can facilitate many other tasks, including knowledge base construction, question answering, and automatic text summarization. The goal of this task is to extract triplets (e1, R, e2) from the unstructured texts, where e1 is the source entity, e2 represents the object entity, and R is the semantic relation between e1 and e2. The supervised methods train statistical and neural models for entities and relations extraction, where these methods need to create a large number of human-annotated datasets to train models. The introduced noisy data can be divided into two types: (1) the entity pair mentioned in the sentences does not express the same relation type corresponding to the entities expressed in the KBs; (2) The target entity pair does not describe any relation type in the sentences

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.