Abstract

Document classification is an open problem in library, information, and computer sciences towards assigning documents to one or more classes. The interest of linguistic researchers in this domain has increased day by day due to interesting applications like language identification, readability assessment, sentiment analysis, spam filtering, etc. However, researchers focussing on natural language processing of resource-scaring languages have faced many hurdles due to the absence of benchmark datasets. Bengali is among the most-spoken resource-scaring or low-resource language. Although Bengali NLP researchers have endeavoured towards creating their own datasets, they are only useful for performance evaluation of their proposed document classification techniques only. Therefore, there is a gap in the literature on the availability of benchmark datasets. To overcome this barrier, this paper presents a benchmark dataset for Bengali document classification, which is publicly accessible and freely available. This dataset consists of a two-tier architecture, the first-tier for hard classification and the second-tier for soft classification techniques. Hard classification techniques follow supervised learning based models for the classification of documents, while on the other hand, soft classification techniques follow unsupervised learning based models for the clustering of documents. The proposed dataset consists of thirteen unique characteristics. This paper also introduces four new feature sets to evaluate the performance of the proposed dataset, namely: location revealing factor, part of speech tagging factor, relative frequency, and prominence factor.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.