Abstract

Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in the area of text data mining. Recent approaches have focused on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches overlook the dimension of pattern reliability, which reflects how likely the extracted facts are correct. As a result, a pattern of good content-quality (e.g., high frequency) may still extract incorrect facts. In this study, we consider both pattern reliability and fact trustworthiness in addressing the pattern-based fact extraction problem. To learn the complex relationship between pattern reliability and fact trustworthiness, we propose a novel deep learning model using a hybrid of the CNN and LSTM architecture. For fact embedding, we adopt CNN to extract a fix-sized representation of each component, i.e., entity, attribute, and value, of the fact. For pattern embedding, we represent the pattern as a semantic composition of its extracted fact representations. To de-emphasis the noisy facts, we consider the fact trustworthiness and frequency during the process of pattern embedding, where the features of the tuple trustworthiness information are extracted by a long short-term memory (LSTM) model. To learn the pattern-fact relational dependency, we train the model with both pattern and tuple labels. Extensive experiments involving three real-world datasets demonstrated that the proposed model significantly improves the quality of the patterns and the extracted facts in the pattern-based information extraction.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.