Abstract

The difficulty of analyzing high-dimensional data makes dimensionality reduction essential, and variable selection is widely used as a dimensionality reduction method to improve the interpretability of models. A variable selection method based on information gain and Fisher's criterion reselection iteration (IFRI) is proposed in this paper. It selects variables that have strong associations with the property of interest by two classical feature selection functions to improve interpretability of the selected variables. At first, information gain was employed to pre-select “global” feature variables with strong ability to distinguish between different classes, and Fisher's criterion was applied to re-select key variables further from the selected “global” variables that enable larger inter-class variance and smaller intra-class variance for the samples. Pearson correlation coefficients were then applied to eliminate variables with strong linear correlation and a series of subsets were obtained iteratively. At the end, cross-validation (CV) was used to calculate the obtained subset, and the one with the lowest root mean square error of cross-validation (RMSECV) was considered optimal. Combining with partial least squares (PLS), it was applied to the analysis of nicotine and total sugar content in tobacco samples. By comparing and validating with two representative variable selection algorithms, CARS and MC-UVE, IFRI has been demonstrated to be an effective and applicable method to establish high-performance models by selecting a few key variables. Differing from most methods, IFRI does not need to depend on the PLS model parameters in the process of variable selection, which makes it more interpretable. Moreover, the selected variables by IFRI are reproducible for the same dataset. Apparently, the principle of the method makes it not limited to spectral data and can be applied to the variable selection of different types of high-dimensional data by combining it with other modeling methods, for example it can be combined with qualitative discriminant models to classify tobacco or used to select critical chemical components affecting the tobacco quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call