A Dynamic Two-Layers MI and Clustering-based Ensemble Feature Selection for Multi-Labels Text Classification

Adil Yaseen Taha,Ali Sabah,Masri Ayob,Abdul Hadi,Sabrina Tiun

doi:10.14569/ijacsa.2020.0110764

Abstract

Multi-label text classification deals with the issue that arises from each sample being related to multiple labels. The text data suffers from high dimensionality. In order to resolve this issue, a feature selection (FS) method can be implemented for efficiently removing the noisy, irrelevant, and redundant features. Multi-label FS is a powerful tool for solving the high-dimension problem. With regards to handling correlation and high dimensionality problems in multi-label text classification, this paper investigates the various heterogeneous FS ensemble schemes. In addition, this paper proposes an enhanced FS method called dynamic multi-label two-layers MI and clustering-based ensemble feature selection algorithm (DMMC-EFS). The proposed method considers the: 1) dynamic global weight of feature, 2) heterogeneous ensemble, and 3) maximum dependency and relevancy and minimum redundancy of features. This method aims to overcome the high dimensionality of multi-label datasets and acquire improved multi-label text classification. We have conducted experiments based on three benchmark datasets: Reuters-21578, Bibtex, and Enron. The experimental results show that DMMC-EFS has significantly outperformed other state-of-the-art conventional and ensemble multi-label FS methods.

Highlights

In multi-label text classification, each sample is related to one or more classes at the same time
This paper presents a scalable multi-label classification method that can handle the high dimensionality problem of the multi-label datasets
This paper proposes a new dynamic multi-label two layers Mutual Information (MI) and clustering-based ensemble feature selection (FS) (DMMC-EFS) method that takes into account the 1) dynamic global weight of the feature; 2) heterogeneous ensemble 3) maximum dependency and relevancy and minimum redundancy of the features

Summary

Introduction

In multi-label text classification, each sample is related to one or more classes at the same time. The difference between main key to a multi-label learning and single label learning is that the labels in the multi-label learning are related and inclusive. The problems related to multi-label learning are more challenging to solve. In the field of machine learning and data mining, multi-label learning is an endeavor task that greatly suffers from high dimensionality [1] [2]. The limitation of this research in multi-label text learning process, there is a significant number of irrelevant, redundant, and disruptive information. The number of involved features is usually large. The high dimensionality of multi-label text data results in challenges such as poor performance, over-fitting, and anything from computational to classification complexity. Some existing multi-label feature selection (FS) methods can

Results

Discussion

Conclusion