A Textual Information Extraction Application based on XML Data Models and a Multidimensional Natural Language Processing Pipeline Approach

Tobias Dorrn,Natalie Dambier,Achim Kuwertz,Almuth Muller

doi:10.1109/inista55318.2022.9894171

Abstract

Modern data and information systems usually contain considerable amounts of data and documents and thus provide a large amount of information. The automatic extraction of domain-specific information is all the more important in order to improve work with such systems. If information is available as free text information, machine processing can prove to be a difficult technical hurdle. State-of-the-art approaches use modern Natural Language Processing (NLP) methods to solve such tasks. In this paper, we want to introduce a data-driven approach, applying an XML data model to an application-specific scenario, using different NLP methods, which are combined into a multidimensional pipeline. It is important to understand how certain NLP methods can be used and what their limitations are. Individual modern NLP methods are often not sufficient and resilient enough to solve complex information extraction tasks. Therefore, it has to be examined how such problems can be alleviated or circumvented by a combination of different NLP methods. As a distinction to categorical grammar models, all cases considered here should be available as free text. The approach presented in this paper is still a work in progress, yet first evaluation results will be given.

Full Text