LILLIE: Information extraction and database integration using linguistics and learning-based algorithms

Ellery Smith,Dimitris Papadopoulos,Martin Braschler,Kurt Stockinger

doi:10.1016/j.is.2021.101938

Abstract

Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with “clean”, structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of “triples” needs to be both (1) of high quality and (2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground.The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction.We have made our code publicly available11https://github.com/OIELILLIE/LILLIE. to facilitate further research.

Highlights

It is commonly known that some 80% of enterprise data is unstructured while only some 20% is structured [1,2]
The paper is organized as follows: in Section 2 we review the related work on information extraction and entity linking for knowledge base construction; in Section 3, we give an overview of the LILLIE architecture; in Sections 4 and 5, we describe the algorithms and functions of the rule-based extractor and the learning-based extractor, respectively; in Sections 6 and 7, we show how to combine both engines, and customize their output; in Section 8, we describe how to apply our triple extractor to the task of entity linking and database insertion; in Section 9, we give a detailed analysis and evaluation of all the components of our system, and compare these to the current state-of-the-art
Entity Linking (EL) systems are capable of resolving the lexical ambiguity of entity mentions and can be extremely useful in a plethora of natural understanding (NLU) applications, by enriching the information extracted via Open Information Extraction (OIE) systems

Summary

Introduction

It is commonly known that some 80% of enterprise data is unstructured while only some 20% is structured [1,2]. In order to query both structured and unstructured data via a single common query interface such as SQL or natural language [3,4], there have been several research efforts over the last years. One such approach, which we follow in our work, is to first use information extraction techniques to retrieve relevant entities (subjects and objects) and relationships. The subject ‘‘THY1’’ and the object ‘‘human gallbladder carcinoma’’ are linked to the relational database Building such an end-to-end pipeline to enable the vision of querying structured and unstructured data via a common interface has been a long standing research effort [6,7].

Information extraction

Entity Linking for knowledge base construction

The rule-based extractor

Architecture of LILLIE

Pre-processing

Triple extraction

The learning-based extractor

In-place coreference resolution

Parallel triple extraction

Triple refinement

Output modification

Entity linking and database integration

Entity linking

Database integration and enrichment

Experiments

Datasets

Performance of LILLIE’s triple extraction pipeline

Ablation study

Error analysis

Positive effect of triple enhancements

10. Database enrichment and querying

Findings

11. Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information Systems	Publication Date: Nov 18, 2021
Citations: 6	License type: cc-by

R Discovery Prime

R Discovery Prime

LILLIE: Information extraction and database integration using linguistics and learning-based algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Systems

Lead the way for us

Similar Papers

Leveraging structured and unstructured electronic health record data to detect reasons for suboptimal statin therapy use in patients with atherosclerotic cardiovascular disease
Glenn T Gobbel ... Salim S Virani
American Journal of Preventive Cardiology | VOL. 9
Glenn T Gobbel, et. al.Glenn T Gobbel ... Salim S Virani
03 Dec 2021
American Journal of Preventive Cardiology | VOL. 9

Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.
T.R Srinivas ... Z Su
American Journal of Transplantation | VOL. 17
T.R Srinivas, et. al.T.R Srinivas ... Z Su
04 Jan 2017
American Journal of Transplantation | VOL. 17

A Information Retrieval Based on Question and Answering and NER for Unstructured Information Without Using SQL
Partha Sarathy Banerjee ... Hardik Gupta
Wireless Personal Communications | VOL. 108
Partha Sarathy Banerjee, et. al.Partha Sarathy Banerjee ... Hardik Gupta
04 May 2019
Wireless Personal Communications | VOL. 108

Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
Thomas M Maddox ... Michael A Matheny
Circulation. Cardiovascular quality and outcomes | VOL. 8
Thomas M Maddox, et. al.Thomas M Maddox ... Michael A Matheny
18 Aug 2015
Circulation. Cardiovascular quality and outcomes | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LILLIE: Information extraction and database integration using linguistics and learning-based algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Systems