Abstract

Multilabel learning is a challenging task demanding scalable methods for large-scale data. Feature selection has shown to improve multilabel accuracy while defying the curse of dimensionality of high-dimensional scattered data. However, the increasing complexity of multilabel feature selection, especially on continuous features, requires new approaches to manage data effectively and efficiently in distributed computing environments. This article proposes a distributed model for mutual information (MI) adaptation on continuous features and multiple labels on Apache Spark. Two approaches are presented based on MI maximization, and minimum redundancy and maximum relevance. The former selects the subset of features that maximize the MI between the features and the labels, whereas the latter additionally minimizes the redundancy between the features. Experiments compare the distributed multilabel feature selection methods on 10 data sets and 12 metrics. Results validated through statistical analysis indicate that our methods outperform reference methods for distributed feature selection for multilabel data, while MIM also reduces the runtime in orders of magnitude.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.