Abstract

Today, the amount of Amharic digital documents has grown rapidly. Because of this, automatic text classification is extremely important. Proper selection of features has a crucial role in the accuracy of classification and computational time. When the initial feature set is considerably larger, it is important to pick the right features. In this paper, we present a hybrid feature selection method, called IGCHIDF, which consists of information gain (IG), chi-square (CHI), and document frequency (DF) features’ selection methods. We evaluate the proposed feature selection method on two datasets: dataset 1 containing 9 news categories and dataset 2 containing 13 news categories. Our experimental results showed that the proposed method performs better than other methods on both datasets 1and 2. The IGCHIDF method’s classification accuracy is up to 3.96% higher than the IG method, up to 11.16% higher than CHI, and 7.3% higher than DF on dataset 2, respectively.

Highlights

  • Amharic is one of the Ethiopian languages, grouped under Semitic branch of Afro-Asiatic language

  • We present a hybrid feature selection method, called IGCHIDF, which consists of information gain (IG), chi-square (CHI), and document frequency (DF) features’ selection methods

  • Erefore, the aim of this paper is to present a hybrid feature selection strategy for Amharic news document classification for improving the performance of the classifier’s accuracy. e proposed feature selection method consists of IG, CHI, and DF as feature selection method, union to combine highly-ranked features, and intersection to join least-ranked features selected by IG, CHI, and DF methods

Read more

Summary

Introduction

Amharic is one of the Ethiopian languages, grouped under Semitic branch of Afro-Asiatic language. E major aim of the existing Amharic text classification focused on performance of text classification algorithm [14,15,16], but not on the feature selection method. Feature selection methods such as information gain (IG), chi-square (CHI), and document frequency (DF) can be used to overcome the curse of dimensionality by eliminating irrelevant features and selecting the most valuable features from the corpus. Erefore, the aim of this paper is to present a hybrid feature selection strategy for Amharic news document classification for improving the performance of the classifier’s accuracy.

Related Works
Preprocessing
Experiment
Results
Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.