A Information Retrieval Based on Question and Answering and NER for Unstructured Information Without Using SQL

Partha Sarathy Banerjee,Deepak Tripathi,Sourabh S Kumar,Baisakhi Chakraborty,Hardik Gupta

doi:10.1007/s11277-019-06501-z

Abstract

In today’s world, the availability of information in the form of unstructured data is in abundance. The unstructured information received is more often than not in the form of natural language text. For any defense establishment, the spy data or any sensitive information received may be best utilized when the information can be extracted efficiently and easily. The proposed model is applicable wherever the influx of text-heavy (unstructured data) is high like the information from the world wide web, documents related to a particular domain, or any other source where the information is in the form of natural language. The proposed Natural Language Information Interpretation and Representation System (NLIIRS) accepts the information in the form of natural language text, processes the information and allows the user to retrieve information by rendering questions in natural language. The questions thus asked by the user are responded by NLIIRS in the form of factoid or phrase based answers. In comparison to the conventional question and answering systems the proposed NLIIRS uses the advantages of both named entity recognition as well as sequential pattern matching based answer search technique. The proposed technique helps us to avoid the use of structured query language (SQL) at the back-end for information processing, storage and extraction. The conversion of user query to SQL statements and also storing the unstructured text in the form of relation tables can be avoided by using NLIIRS. By using this approach in our novel text processing algorithm, after every execution step, the pattern matching and extraction process of the answers to the queries becomes concise and faster. The whole system has been designed on natural language tool kit of Stanford University which helped us to generate parts of speech tag, tokenize the data, and forming tree structure. The novel text processing algorithm utilizes the lemmatizer, stemmer and ne_chunker to prepare the text for information retrieval via Q&A. The advantage of this system is that it does not need training. This system will enable the user to retrieve any information of his/her choice from the available unstructured information.

Full Text