The challenge of getting big amounts of high-quality labeled data is compounded by the fact that data labeling is often subjective and requires significant human effort. In many cases, the quality of the labeled data depends entirely on the expertise and experience of human annotators, making it challenging to ensure labeling accuracy in large and dynamic datasets. Moreover, there may be a significant delay between the arrival of a new instance and its manual labeling. This paper explores the use of fully unsupervised feature selection algorithms in non-stationary data streams, where the importance of features may change over time. We introduce a novel feature selection algorithm called Online Fast FEa-ture SELection-OFFESEL, which calculates the feature importance scores in each incoming window based on their mean normalized values and without using any class labels. We evaluate OFFESEL on 17 benchmark data streams, both stationary and non-stationary, using popular online classifiers like PerceptronMask, VFDT, Online Boosting, and Linear SVM. We compare OFFESEL to several other feature selection algorithms, including state-of-the-art supervised ones like FIRES and ABFS, as well as popular unsupervised ones like MCFS, LS, and Max Variance, which we adapted to data streams. Our results indicate that OFFESEL outperforms all supervised and unsupervised feature selection algorithms in terms of classification accuracy. Specifically, OFFESEL preserves the accuracy level of the supervised FIRES algorithm, which proved more accurate than ABFS in our experiments, while maintaining the accuracy level achieved by the unsupervised Max Variance algorithm. Moreover, OFFESEL requires even less computation time than Max Variance and shows high stability on stationary datasets. Overall, our study demonstrates the potential benefits of using unlabeled data for feature ranking and selection in dynamic data streams.
Read full abstract