An Apriori Method for Topic Extraction from Text Files

Anil Kumar K M,Shashank R,Amogha Subramanya D A,Ajay B

doi:10.35940/ijrte.a3068.078219

Abstract

In this data age peta-bytes of data is generated every day. One of the biggest challenge today is to convert this data into useful information, this is known as data mining. Important kinds of data include text-based data, audio-based data, image-based data, video-based data etc. An important challenge in mining useful information from text-based data source (text mining) is topic modeling which is to find out the topic the text is talking about. The solution to this problem finds application, in clustering files based on the topic, pre-processing method in information retrieval, ontology of medical record etc. A lot of research work has gone into this area of topic modeling, and many approaches have been formulated. Some of these approaches take into account the occurrence and frequency of occurrence of words/terms, these models come under the Bag Of Words(BOW) approach. Others take into account the underlying structure in the corpus of text used, Wikipedia category graph is an example of this approach. This paper, provides an unsupervised solution to the above problem by extracting keywords that represent the topic of the text document. In our approach, topic modeling is carried out with a hybrid model which makes use of WordNet and Wikipedia Corpus. Promising experimental results have been obtained for well- known news dataset (BBCNews) from our model. We present the experimental result for our proposed approach along with the results of others in the same domain and show that our approach provides better results.

Full Text