Trainable Framework for Information Extraction, Structuring and Summarization of Unstructured Data, Using Modified NER

Partha Sarathy Banerjee,Baisakhi Chakraborty,Harsh Upadhyay,Utkarsh Anand

doi:10.1007/s11277-020-07896-w

Abstract

The World Wide Web is an ever expanding source of data in today’s world. Millions of tera-bytes of data and information is getting added every second. In this information age as the data is getting generated at an exponential rate, the fact to be noted is that most of the information is already available is in the form of natural language text. The task of information extraction from mammoth data leads us to think on the quality and the form of available data. Secondly, the ever increasing data poses a challenging task of extracting useful information from the available data. The third task is to extract information as efficiently as possible. For retrieving the information there is a need to develop ingenious way to answer any kind of query put up by a user from the available unstructured data. This paper proposes a novel trainable and integrated Natural Language Information Interpretation and Representation System (NLIIRS) that accepts any available un-annotated corpus of data in the form of natural language, and performs the following tasks: finds out the useful data, extracts relevant information in usable form (structured form/tables), summarizes the data and structures the data in relational form. At the end the Question and Answering (Q&A) module shows the cognitive abilities of NLIIR system by answering the questions in natural language relevant to the text. This multispecialty system beyond just Q&A. This is a trainable system capable of handling any unstructured data to be transformed into structured and well organized information. It allows the user to ask questions in natural language. It adopts the advantages of a modified named entity recognition so as to bypass the time consuming process of parts of speech tagging while pre-processing the available corpus (data) for information extraction.

Full Text