Relation Extraction With Clause-Based Open Information Extraction

Duc Thuan Vo

doi:10.32920/17303840

Abstract

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>

Highlights

IntroductionA running theme among these techniques is the capacity to generate linguistic features based on syntactic, dependency, or shallow semantic structures of the text
Considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction
We use the English clause structure and clause types in an effort to generate propositions that can be deemed as extractable relations

Summary

Introduction

A running theme among these techniques is the capacity to generate linguistic features based on syntactic, dependency, or shallow semantic structures of the text Espousing these features, the models are subsequently trained to identify instances of entities that are related through relations. The main strategy used in supervised methods is to generate linguistic features based on syntactic, dependency, or shallow semantic structures of text Based on these features, the models are trained to identify pairs of entities that might be related through some relation, and to classify them based on a predefined set of relation types. The common strategy of these techniques is to generate linguistic features based on syntactic, dependency, or shallow semantic structures of the text Based on these features, supervised learning methods are used to identify pairs of events that are related to each other and can be classified into pre-defined relation types. Training on linguistic structures such as the output of syntactic and dependency parsers does not necessarily identify all possible types of event relations when they are presented in different sentences or different documents

Objectives

Methods

Results

Discussion

Conclusion