Deep truth discovery for pattern-based fact extraction

Chen Ye,Hongzhi Wang,Wenbo Lu,Jing Gao,Guojun Dai

doi:10.1016/j.ins.2021.08.084

Abstract

Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in the area of text data mining. Recent approaches have focused on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches overlook the dimension of pattern reliability, which reflects how likely the extracted facts are correct. As a result, a pattern of good content-quality (e.g., high frequency) may still extract incorrect facts. In this study, we consider both pattern reliability and fact trustworthiness in addressing the pattern-based fact extraction problem. To learn the complex relationship between pattern reliability and fact trustworthiness, we propose a novel deep learning model using a hybrid of the CNN and LSTM architecture. For fact embedding, we adopt CNN to extract a fix-sized representation of each component, i.e., entity, attribute, and value, of the fact. For pattern embedding, we represent the pattern as a semantic composition of its extracted fact representations. To de-emphasis the noisy facts, we consider the fact trustworthiness and frequency during the process of pattern embedding, where the features of the tuple trustworthiness information are extracted by a long short-term memory (LSTM) model. To learn the pattern-fact relational dependency, we train the model with both pattern and tuple labels. Extensive experiments involving three real-world datasets demonstrated that the proposed model significantly improves the quality of the patterns and the extracted facts in the pattern-based information extraction.

Full Text