Abstract

AbstractWe live in a world where knowledge is extremely valuable, and the amount of information available in text documents has grown to the point, where it is difficult to find those that are important to us. As a result, language-based classification of text documents is important. Telugu is one of the morphologically rich Dravidian languages. Since there are many Telugu documents available on the Internet, it is important to organize the data by automatically assigning a collection of documents into predefined labels based on their content using modern techniques. On the basis of the Telugu corpus, we proposed Telugu text document classification using a variety of machine learning algorithms and feature extraction techniques. We gathered 1990 documents from an online newspaper, divided into three categories: cinema (467), sports (839), and politics (684). In this paper, we used the N-gram feature extraction method to apply the naive Bayes (NB) classifier and the one-hot encoding vectorization method to apply multinomial naive Bayes (MNB), support vector machine(SVM), and logistic regression (LR). We used 1990 documents to extract uni-gram and bi-gram features and 120 unseen documents to test a naive Bayes classifier for the n-gram approach, and we got 99% accuracy in uni-gram and 97% accuracy in bi-gram. We used 1375 documents (70%) for training and 597 documents (30%) for testing to construct a one-hot encoding vector (based on the size of the vocabulary). For classification, we used the multinomial naive Bayes, support vector machine, and logistic regression algorithms. MNB provides 98% accuracy, SVM provides 99% accuracy, and logistic regression provides 98% accuracy.KeywordsText classificationN-gramsSupervised machine learningNaïve Bayes classifierMultinomial naïve BayesSupport vector machineLogistic regressionFeature selectionVectorization

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.