Abstract
Text categorization has a variety of applications, such as sentiment analysis of user’s tweet, categorizing blog posts into different categories, etc. The real-time data available for categorization is usually unstructured. An efficient algorithm for preprocessing the data can help to achieve better accuracy. Term frequency–inverse document frequency (tf-idf) and word2vec word embedding techniques are used widely before applying the text classification model. In order to show the enactment of these techniques on text categorization, we are comparing the accuracies of different multi-class text categorization algorithms such as Support Vector Machine (SVM), Logistic Regression and K-Nearest Neighbor (KNN) on these techniques. TagMyNews dataset is used to train the model. The results indicate that word2vec is efficient word embedding technique as it possesses higher accuracies for all the classification methods (KNN: 79.38%, SVM: 93.59%, Logistic Regression: 87.46%) as compared to tf-idf (KNN: 73.37%, SVM: 84%, Logistic Regression: 73.98%).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.