Document classification is an open problem in library, information, and computer sciences towards assigning documents to one or more classes. The interest of linguistic researchers in this domain has increased day by day due to interesting applications like language identification, readability assessment, sentiment analysis, spam filtering, etc. However, researchers focussing on natural language processing of resource-scaring languages have faced many hurdles due to the absence of benchmark datasets. Bengali is among the most-spoken resource-scaring or low-resource language. Although Bengali NLP researchers have endeavoured towards creating their own datasets, they are only useful for performance evaluation of their proposed document classification techniques only. Therefore, there is a gap in the literature on the availability of benchmark datasets. To overcome this barrier, this paper presents a benchmark dataset for Bengali document classification, which is publicly accessible and freely available. This dataset consists of a two-tier architecture, the first-tier for hard classification and the second-tier for soft classification techniques. Hard classification techniques follow supervised learning based models for the classification of documents, while on the other hand, soft classification techniques follow unsupervised learning based models for the clustering of documents. The proposed dataset consists of thirteen unique characteristics. This paper also introduces four new feature sets to evaluate the performance of the proposed dataset, namely: location revealing factor, part of speech tagging factor, relative frequency, and prominence factor.
Read full abstract