PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

Chun-Xia Zhang,Sang-Woon Kim,Jiang-She Zhang

doi:10.1007/s00180-016-0652-8

Abstract

Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.

Full Text