Multi-text classification of Urdu/Roman using machine learning and natural language preprocessing techniques

M Ameen Chhajro

doi:10.17485/ijst/v13i19.230

Abstract

Objectives: This research presents multi-text classification from the news text dataset. The main purpose of this work is to classify multi-text for Urdu and Roman language using Natural Language processing and Machine Learning classification models. Methods/Statistical analysis: In this research, online news data has been collected through beautiful soup web scraping tool. In order to analyze the model accuracy news data is divided into six categories which has been composed from various online newspaper platforms. The main news corpus data consists of 10500 news in Urdu and Roman Urdu language including, Accidental, Education, Entertainment, International, Sports and Weather news have been primarily focused in the proposed research study. Furthermore, preprocessing is performed on text corpus using Natural Language Processing technique; for example, data cleaning, data balancing, and stop word removal. For feature extraction count vector, TF-IDF and Chi2 are employed as word filtering. For multi-text classification the Machine Learning classification schemes have been implemented namely, Naive Bayes Classifier, Logistic Regression, Random Forest Classifier, Linear SVC, and KNeighbors Classifier. After comparative analysis results showed that Linear Support Vector Classifier provided 96% accuracy among other tested methods. Findings: Multi-Text classification of Urdu Roman language having different writing styles, word structure, irregularities, grammar, and combined corpus is a challenging task. For this purpose, we implemented different Machine Learning algorithms with Natural Language preprocessing technique which provided optimal results in classification of multi-text news data. Keywords: MultiText Classification; Machine Learning; NLP Preprocessing Techniques

Highlights

The Mutli-Text classification of Urdu language through combination of Urdu and Roman Urdu text is being considered one of the most challenging task in text classification
Articles, stories, blogs and reviews text content typically organized by topics and different products tagged by categories and users can be classified on the basis on how they talk about particular brand or product on online web based platforms the majority of text classification blogs and tutorials on the internet can be found in the form of binary text classification whose common example include email classification such as email spam filtering, sentiment analysis respectively
We have considered Term Frequency Inverse Document Frequency (TF-IDF) in as feature extraction technique in order to get the advantages like it is Easy to compute, provides some basic metric to extract the most descriptive terms in the document and similarity can be calculated between two documents

Summary

Introduction

The Mutli-Text classification of Urdu language through combination of Urdu and Roman Urdu text is being considered one of the most challenging task in text classification. Articles, stories, blogs and reviews text content typically organized by topics and different products tagged by categories and users can be classified on the basis on how they talk about particular brand or product on online web based platforms the majority of text classification blogs and tutorials on the internet can be found in the form of binary text classification whose common example include email classification such as email spam filtering (spam vs ham), sentiment analysis (positive vs negative) respectively. For the analysis of textual data category, the most common and useful approach which plays an important role in the field of NLP like opinion mining, sentiment analysis, tweets, reviews, spam detection, email filtering is the common example of text categorization[3]

Objectives

Methods

Results

Conclusion