Abstract
In most of the research, topic detection is defined as the task of finding out different themes from the collection of documents. Our topic detection approach is about finding a topic for every document in the corpus. Any word or group of words which tells what the document is about is defined as the topic of the document. In this paper, we propose a novel topic detection approach using an unsupervised model. It is a simple yet effective approach for topic detection and finding keywords from the corpus. The keywords are extracted by identifying the relationship between the words in a set of unstructured data automatically, without any set of training data. The keyword extraction is based on an hypothesis for word decomposition which says that the words in bigram or trigram word vectors would have words that can be potential distribution of words from the unigram word vector. After keyword extraction, topics are determined for each document using our proposed algorithm of topic detection. The proposed algorithm finds the most suitable topic for each document. The topics detected in the entire corpus and the keywords related with each topic are stored and analyzed. We use the standard term frequency (TF) measure for finding the keywords. The effectiveness and accuracy of keywords is judged by using these keywords as features for classification and comparing the results against the standard bag-of- words approach. The topics detected by our algorithm are found to be relevant to the document. The experimental results using keywords show that the dimensionality of the corpus is drastically reduced while maintaining and in most of the cases, improving F-measure of categorization. Thus, it shows that our approach of feature selection for text categorization not only improves the classification accuracy but also reduces considerably the time required for classification.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.