Entity Extraction from Unstructured Data on the Web

Tan Dat Huynh

doi:10.14264/uql.2015.9

Abstract

A large number of web pages contain information about entities in lists where the lists are represented in textual form. Textual lists contain implicit records of entities. However, the field values of such records cannot easily be separated or extracted by automatic processes. This, therefore, remains a challenging research problem in the literature. Previous studies in the literature relied mainly on probabilistic graph-based models to capture the attributes and the likely structures of implicit records in a list. However, one of the important limitations of existing methods is that the structures of the records in input lists were implicitly encoded via training data which was manually created. This thesis aims to investigate novel techniques to acquire automatically information about entities from implicit records embedded in textual lists on the web. This thesis introduces a self-supervised learning framework which exploits both existing data in a knowledge base and the structural similarity between sequences in lists to build an extraction model automatically. In the proposed framework, initial labels for candidate field values are created and assigned to generate label sequences. Then, the structure of implicit records is captured via a graphical model to assign unmatched labels and rectify mismatched labels. As a result of which, the process of entity extraction from lists can be completely unsupervised and automated without user intervention. In order to attain that outcome, we address three substantive research problems that need to be solved in this thesis. Firstly, the text segments in input lists need to be assigned labels precisely so that their statistical information is then used to build an extraction model. However, previous studies have not considered completely both the format and content of field values when performing the text segmentation and assigning labels. By viewing the problem of assigning labels for text segments as the problem of membership checking in set theory, we identify and propose a dyadic representation of semantic relations between a text segment and an attribute by using its extensional and intensional representations. We incorporate those representations to define a novel format-enhanced labelling technique to assign labels for text segments. Secondly, the labels of identical concepts with differing sequences in an input list are often located in similar positions but the positions of the labels may vary somewhat in different sequences. However, until this point, there has been no information extraction system designed to capture the distribution of labels in differing positions in order to enhance extraction results. To capture the positional information of labels, we are proposing a proximity-based positional model, which is combined with a sequential model to improve the quality of the label-refinement phase in our framework. Thirdly, in order to reduce dependence on the overlap between knowledge bases and input lists, we exploit structural similarity between text segments and sequences in the input lists, and devise a structure-based similarity and data shifting-alignment technique to align text segments into groups before their labels are revised by a graphical model. By the proposed technique, we can reduce the dependency on the overlap between knowledge bases and input lists whilst maintaining high performance of extraction model. Initially experimental results demonstrate that our proposed techniques perform well when compared to the state-of-the-art method. We hope that the results presented in this thesis contribute to efforts on the extraction of information about entities in textual lists. Additionally, they contribute towards forthcoming research on the synthesis of information from different lists, and the provision of reasoning capacity by which to detect new relationships between entities drawn from raw lists on the web.

Full Text