Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms

Syed Muhammad Hassan,Shaukat Wasi,Fayyaz Ali,Imtiaz Hussain,Syeda Nazia Ashraf,Samreen Javeed

doi:10.17485/ijst/2019/v12i35/146571

Abstract

Objectives: Roman-Urdu consider as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headline from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification. Keywords: Linear SVC, Multinomial Naïve Bays (MNB), Ridge Classifier (RC), Random Forest, Roman-Urdu, Supervised Machine Learning, Stochastic Gradient Descent (SGD), Text Classification, Tf-Idf

Highlights

Large amount of data with its all variations on internet is available nowadays; most interestingly languages are no more barriers to identify information
Most of researchers previously work on Roman-Urdu in the context of sentiment analysis and opinion mining with limited number of Supervised Machine Learning Algorithms such as Naïve Bays (NB), Logistic Regression with Stochastic Gradient Descent (LRSGD) and Support Vector Machine (SVM)
We observed SGD classifier gives better results 93.50% as compare to top 5 algorithms with different variations in all categories according to their applied techniques

Summary

Introduction

Large amount of data with its all variations on internet is available nowadays; most interestingly languages are no more barriers to identify information. Roman-Urdu is one of the most popular and increasing demanding language nowadays with blend of English and Urdu[1]. To analyze text with its category is most common and useful technique that cover all major field of Natural Language Processing for example sentiment analysis, opinion mining, reviews, tweets, blogs, spam detection and something whose sentiment is to be evaluated. Two processes: stop words removal, feature vector, which use to predict class for sentences by applying the machine learning algorithms. Most of researchers previously work on Roman-Urdu in the context of sentiment analysis and opinion mining with limited number of Supervised Machine Learning Algorithms such as Naïve Bays (NB), Logistic Regression with Stochastic Gradient Descent (LRSGD) and Support Vector Machine (SVM).

Objectives

Methods

Results

Conclusion