Abstract

Most of the previous works on relation extraction between named entities are often limited to extracting the pre-defined types; which are inefficient for massive unlabeled text data. Recently; with the appearance of various distributional word representations; unsupervised methods for many natural language processing (NLP) tasks have been widely researched. In this paper; we focus on a new finding of unsupervised relation extraction; which is called distributional relation representation. Without requiring the pre-defined types; distributional relation representation aims to automatically learn entity vectors and further estimate semantic similarity between these entities. We choose global vectors (GloVe) as our original model to train entity vectors because of its excellent balance between local context and global statistics in the whole corpus. In order to train model more efficiently; we improve the traditional GloVe model by using cosine similarity between entity vectors to approximate the entity occurrences instead of dot product. Because cosine similarity can convert vector to unit vector; it is intuitively more reasonable and more easily converge to a local optimum. We call the improved model RGloVe. Experimental results on a massive corpus of Sina News show that our proposed model outperforms the traditional global vectors. Finally; a graph database of Neo4j is introduced to store these relationships between named entities. The most competitive advantage of Neo4j is that it provides a highly accessible way to query the direct and indirect relationships between entities.

Highlights

  • With the explosive growth and easy accessibility of web documents, extracting the useful nuggets from the irrelevant and redundant messages becomes a cognitively demanding and time consuming task

  • For RQ1, this paper presents an improved model of global vectors called RGloVe based on the idea of distributed representation

  • For the task of distributional relation representation, we propose an improved global vectors model called RGloVe which can train the word vectors more effectively

Read more

Summary

Introduction

With the explosive growth and easy accessibility of web documents, extracting the useful nuggets from the irrelevant and redundant messages becomes a cognitively demanding and time consuming task. Under this circumstance, information extraction is proposed to extract the structured data from text documents. The automatic content extraction (ACE) program [1] provides annotated corpus and evaluation criteria for a series of information extraction tasks. Traditional relation extraction is often limited to extracting the pre-defined types. ACE 2003 defines five relation types, including AT (location relationships), NEAR (to identify relative locations), PART (part-whole relationships), ROLE (the role a person plays in an organization) and Algorithms 2017, 10, 42; doi:10.3390/a10020042 www.mdpi.com/journal/algorithms

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.