Abstract

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.

Highlights

  • The rapid growth of electronics text documents on internet, World Wide Web (WWW), news blogs, and digital libraries by organizations, researchers, news media, and institutions is causing problems like a large volume of unstructured data

  • After hyperparameter tuning of our model, we evaluated the performance of Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) using four different ways: 1) without removing both stopwords and rare words, 2) after removing stopwords 3) after removing both stopwords and rare words and 4) with different split-ratios of the dataset into training and testing subsets

  • Because of the comparisons of Machine Learning (ML) and Deep Learning (DL) models where DL models showed superior performance than ML models, this study has opened a gate for text document classification using deep learning models

Read more

Summary

INTRODUCTION

The rapid growth of electronics text documents on internet, World Wide Web (WWW), news blogs, and digital libraries by organizations, researchers, news media, and institutions is causing problems like a large volume of unstructured data. Like other languages, Urdu text documents on WWW, blogs, online libraries, and news articles are increasing rapidly. All these are causing to grow the interest of researchers in TDC of Urdu language. We use a Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) model to classify text documents of Urdu language. Researchers used different but randomly selected split-ratio to split their datasets It decreased the performance of the classifier because of an insufficient number of documents in training or testing subset [18]. We investigate the effects in the performance of SMFCNN after removing stopwords and rare words from Urdu text

URDU LANGUAGE AND ITS FEATURES
OUR CONTRIBUTION
RELATED WORK
PROPOSED MODEL AND DATASET
HYPERPARAMETER SETTINGS
RESULTS AND DISCUSSION
COMPARISON OF SMFCNN WITH ML CLASSIFIERS
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call