Abstract

Data normalization is one of the pre-processing approaches where the data is either scaled or transformed to make an equal contribution of each feature. The success of machine learning algorithms depends upon the quality of the data to obtain a generalized predictive model of the classification problem. The importance of data normalization for improving data quality and subsequently the performance of machine learning algorithms has been presented in many studies. But, the work lacks for the feature selection and feature weighting approaches, a current research trend in machine learning for improving performance. Therefore, this study aims to investigate the impact of fourteen data normalization methods on classification performance considering full feature set, feature selection, and feature weighting. In this paper, we also present a modified Ant Lion optimization that search feature subsets and the best feature weights along with the parameter of Nearest Neighbor Classifier. Experiments are performed on 21 publicly available real and synthetic datasets, and results are analyzed based on the accuracy, the percentage of feature reduced and runtime. It has been observed from the results that no single method outperforms others. Therefore, we have suggested a set of the best and the worst methods combining the normalization procedure and empirical analysis of results. The better performers are z-Score and Pareto Scaling for the full feature set and feature selection, and tanh and its variant for feature weighting. The worst performers are Mean Centered, Variable Stability Scaling and Median and Median Absolute Deviation methods along with un-normalized data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call