Abstract

Companies have an increasing access to very large datasets within their domain. Analysing these datasets often requires the application of feature selection techniques in order to reduce the dimensionality of the data and prioritize features for downstream knowledge generation tasks. Effective feature selection is a key part of clustering, regression and classification. It presents a myriad of opportunities to improve the machine learning pipeline: eliminating redundant and irrelevant features, reducing model over-fitting, faster model training times and more explainable models. By contrast, and despite the widespread availability and use of categorical data in practice, feature selection for categorical and/or mixed data has received relatively little attention in comparison to numerical data. Furthermore, existing feature selection methods for mixed data are sensitive to number of objects by having nonlinear time complexities with respect to number of objects. In this work, we propose a generic multiple association measure for mixed datasets and a novel feature selection algorithm that uses multiple association across features. Our algorithm is based upon the belief that the most representative chosen set of features should be as diverse and minimally dependent on each other as possible. The proposed algorithm formulates the problem of feature selection as an optimization problem, searching for the set of features that have minimum association amongst them. We present a generic multiple association measure and two associated feature selection algorithms: Naive and Greedy Feature Selection Algorithms called NFSA and GFSA, respectively. Our proposed GFSA algorithm is evaluated on 15 benchmark datasets, and compared to four existing state of the art feature selection techniques. We demonstrate that our approach provides comparable downstream classification performance outperforming other leading techniques on several datasets. Both time complexity analysis and experimental results show that our proposed algorithm significantly reduces the processing time required for unsupervised feature selection algorithms especially for long datasets which have a huge number of objects, whilst also yielding comparable clustering and classification performance. On the other hand, we do not recommend our approach for wide datasets where the number of features is huge with respect to the number of objects e.g., image, text and genome datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.