Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

Muhammad Pervez Akhter,Zheng Jiangbin,Muhammad Tariq Sadiq,Irfan Raza Naqvi,Atif Mehmood,Mohammed Abdelmajeed

doi:10.1109/access.2020.2976744

Abstract

The rapid growth of electronic documents are causing problems like unstructured data that need more time and effort to search a relevant document. Text Document Classification (TDC) has a great significance in information processing and retrieval where unstructured documents are organized into pre-defined classes. Urdu is the most favorite research language in South Asian languages because of its complex morphology, unique features, and lack of linguistic resources like standard datasets. As compared to short text, like sentiment analysis, long text classification needs more time and effort because of large vocabulary, more noise, and redundant information. Machine Learning (ML) and Deep Learning (DL) models have been widely used in text processing. Despite the major limitations of ML models, like learn directed features, these are the favorite methods for Urdu TDC. To the best of our knowledge, it is the first study of Urdu TDC using DL model. In this paper, we design a large multi-purpose and multi-format dataset that contain more than ten thousand documents organize into six classes. We use Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) for classification and compare its performance with sixteen ML baseline models on three imbalanced datasets of various sizes. Further, we analyze the effects of preprocessing methods on SMFCNN performance. SMFCNN outperformed the baseline classifiers and achieved 95.4%, 91.8%, and 93.3% scores of accuracy on medium, large and small size dataset respectively. The designed dataset would be publically and freely available in different formats for future research in Urdu text processing.

Highlights

The rapid growth of electronics text documents on internet, World Wide Web (WWW), news blogs, and digital libraries by organizations, researchers, news media, and institutions is causing problems like a large volume of unstructured data
After hyperparameter tuning of our model, we evaluated the performance of Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) using four different ways: 1) without removing both stopwords and rare words, 2) after removing stopwords 3) after removing both stopwords and rare words and 4) with different split-ratios of the dataset into training and testing subsets
Because of the comparisons of Machine Learning (ML) and Deep Learning (DL) models where DL models showed superior performance than ML models, this study has opened a gate for text document classification using deep learning models

Summary

INTRODUCTION

The rapid growth of electronics text documents on internet, World Wide Web (WWW), news blogs, and digital libraries by organizations, researchers, news media, and institutions is causing problems like a large volume of unstructured data. Like other languages, Urdu text documents on WWW, blogs, online libraries, and news articles are increasing rapidly. All these are causing to grow the interest of researchers in TDC of Urdu language. We use a Single-layer Multisize Filters Convolutional Neural Network (SMFCNN) model to classify text documents of Urdu language. Researchers used different but randomly selected split-ratio to split their datasets It decreased the performance of the classifier because of an insufficient number of documents in training or testing subset [18]. We investigate the effects in the performance of SMFCNN after removing stopwords and rare words from Urdu text

URDU LANGUAGE AND ITS FEATURES

OUR CONTRIBUTION

RELATED WORK

PROPOSED MODEL AND DATASET

HYPERPARAMETER SETTINGS

RESULTS AND DISCUSSION

COMPARISON OF SMFCNN WITH ML CLASSIFIERS

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2020
Citations: 63	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

ETCNN: Extra Tree and Convolutional Neural Network-based Ensemble Model for COVID-19 Tweets Sentiment Classification.
Muhammad Umer ... Imran Ashraf
Pattern recognition letters | VOL. 164
Muhammad Umer, et. al.Muhammad Umer ... Imran Ashraf
01 Dec 2022
Pattern recognition letters | VOL. 164

Artificial Intelligence and Machine Learning: What You Always Wanted to Know but Were Afraid to Ask
Puru Rattan ... Daniel D Penrice
Gastro hep advances | VOL. 1
Puru Rattan, et. al.Puru Rattan ... Daniel D Penrice
01 Jan 2021
Gastro hep advances | VOL. 1

Explainable artificial intelligence (XAI) for predicting the need for intubation in methanol-poisoned patients: a study comparing deep and machine learning models
Khadijeh Moulaei ... Sayed Masoud Hosseini
Scientific Reports | VOL. 14
Khadijeh Moulaei, et. al.Khadijeh Moulaei ... Sayed Masoud Hosseini
08 Jul 2024
Scientific Reports | VOL. 14

Explainable artificial intelligence (XAI) for predicting the need for intubation in methanol-poisoned patients: a study comparing deep and machine learning models
Khadijeh Moulaei ... Sayed Masoud Hosseini
Scientific Reports | VOL. 14
Khadijeh Moulaei, et. al.Khadijeh Moulaei ... Sayed Masoud Hosseini
08 Jul 2024
Scientific Reports | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Document-Level Text Classification Using Single-Layer Multisize Filters Convolutional Neural Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions