Filter- and wrapper-based feature selection for predicting user interaction with Twitter bots

Amri Napolitano,Randall Wald,Taghi Khoshgoftaar

doi:10.1109/iri.2013.6642501

Abstract

High dimensionality (the presence of too many features) is a problem which plagues many datasets, including mining from personality profiles. Feature selection can be used to reduce the number of features, and many strategies have been proposed to help select the most important features from a larger group. Feature rankers will produce a metric for each feature and return the best for a given subset size, while filter-based subset evaluation will perform statistical analysis on whole subsets and wrapper-based subset selection will use classification models with chosen features to decide which are most important for model-building. While all three approaches have been discussed in the literature, relatively little work compares all three with one another directly. In the present study, we do precisely this, considering feature ranking, filter-based subset evaluation, and wrapper-based subset selection (along with no feature ranking) on two datasets based on predicting interaction with bots on Twitter. For the two subset-based techniques, we consider two search techniques (Best First and Greedy Stepwise) to build the subsets, while we use one feature ranker (ROC) chosen for its excellent performance in previous works. Six learners are used to build models with the selected features. We find that feature ranking consistently performs well, giving the best results for four of the six learners on both datasets. In addition, all of the techniques other than feature ranking perform worse than no feature selection for four of six learners. This leads us to recommend the use of feature ranking over more complex subset evaluation techniques.

Full Text