Abstract

The usefulness of automated information extraction tools in generating structured knowledge from unstructured and semi-structured machine-readable documents is limited by challenges related to the variety and intricacy of the targeted entities, the complex linguistic features of heterogeneous corpora, and the computational availability for readily scaling to large amounts of text. In this paper, we argue that the redundancy and ambiguity of subject–predicate–object (SPO) triples in open information extraction systems has to be treated as an equally important step in order to ensure the quality and preciseness of generated triples. To this end, we propose a pipeline approach for information extraction from large corpora, encompassing a series of natural language processing tasks. Our methodology consists of four steps: i. in-place coreference resolution, ii. extractive text summarization, iii. parallel triple extraction, and iv. entity enrichment and graph representation. We manifest our methodology on a large medical dataset (CORD-19), relying on state-of-the-art tools to fulfil the aforementioned steps and extract triples that are subsequently mapped to a comprehensive ontology of biomedical concepts. We evaluate the effectiveness of our information extraction method by comparing it in terms of precision, recall, and F1-score with state-of-the-art OIE engines and demonstrate its capabilities on a set of data exploration tasks.

Highlights

  • Open information extraction (OIE) systems aim at distilling structured representations of information from natural language text, usually in the form of triples or n-ary propositions

  • We evaluate the validity of our information extraction process and present a number of indicative data exploration tasks on the CORD-19 dataset, which are enabled by the structured representation of the extracted information

  • We analyzed the rationale and functioning of each preprocessing component comprising our methodology—namely, in-place coreference resolution that was applied on the raw CORD-19 corpus to replace pronouns with their original references and extractive summarization which was subsequently used to reflect the diverse topics of each article while keeping redundancy to a minimum

Read more

Summary

Introduction

Open information extraction (OIE) systems aim at distilling structured representations of information from natural language text, usually in the form of triples or n-ary propositions. Contrary to ontology-based information extraction (OBIE) systems which rely on pre-defined ontology schemas, OIE systems follow a relation-independent extraction paradigm tailored to massive and heterogeneous corpora They can play a key role in many NLP (natural language processing) applications like natural understanding and knowledge base construction by extracting phrases that indicate novel semantic relationships between entities. There are many approaches for extracting triples in the form of {subject, predicate, object} from unstructured text, there is no standardized way of efficiently generating, mapping, and representing these triples in a manner that facilitates end-user applications These limitations are primarily prevalent in larger corpora, where the high number of duplicate and/or low-quality triples as a result of topic irrelevant sentences or complex syntactic.

Objectives
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.