Implementation of a Finite State Automaton to Recognize and Remove Stop Words in English Text on its Retrieval

Sudersan Behera

doi:10.1109/icoei.2018.8553828

Abstract

Now a day's electronic media or the World Wide Web is the main source of information storage. The data that is stored in the web is structured, semistructured as well as unstructured in nature. When we say unstructured data it means all the text data. Text processing is plays a crucial role for processing structured and unstructured data from the web. Preprocessing is the main in any text processing systems. For an efficient computation of text processing it is necessary to remove the common words or stop words from the document while scanning the document. A large number of stop word removal algorithms has been proposed which are generally based on dictionary containing stop word list. Then pattern matching techniques are applied to find identify and remove the stop word. Earlier approaches are time consuming task. In total, these methods are inefficient and very expensive. Here I am proposing a solution to identify the stop word present in the English language using finite state automata. In comparison of that my algorithm has been tested on 220 documents and achieved 99% of accuracy and it also time efficient.

Full Text