Enhanced Multi-Class Newsgroup Document Classification Using N-Gram Approach

Ali Abbas,Tanveer J Siddiqui,Shreya Agarwal,Prajna Jha,Manish Jaiswal

doi:10.59670/ml.v21is6.7909

Abstract

With the advancement of technology, consumption of data and information over web is increasing day by day. This has led to rising dependence of our lives on internet data at personal and professional fronts. Being a populous nation with high internet penetration, India is a huge market in digital media. The adaptability to online news, has created a large number of text news data which requires classification into categories for further applications. To address the work for efficient news document classification in the field of computational linguistics, we have proposed a novel approach for Newsgroup classification by combining tokens with a context window size of one, two and three known as unigram, bigram and trigram respectively.  In our research work, we explored the relevance of trigram by experimenting over different collection of n-grams using three supervised classifiers namely Perceptron, Nearest Centroid, and Random Forest. In the experiment, we found that inclusion of trigram along with unigram and bigram achieves the best result in terms of F1-macro score and accuracy among all three classifiers. Perceptron shows the best results in comparison with Random Forest and Perceptron classifiers, with an accuracy of 0.846945 and an F1-macro score of 0.83055. Significant time was spent on the validation of our efficient approach with the existing literature.

Full Text