Abstract
Dynamic features applications present new obstacles for the selection of streaming features. The dynamic features applications have various characteristics: a) features are processed sequentially while the number of instances is fixed; and b) the feature space does not exist in advance. For example, in a text classification task for spam detection, new features (e.g. words) are dynamically generated and therefore need to be mined to filter out the spams rather than waiting for all features to be collected in order to do so. Traditional feature selection methods, which are not designed for streaming features applications, cannot be used in such an environment, as they require the full feature space in advance in order to statistically determine the representative features. Existing methods that address feature selection in dynamic features applications require the class labels in order to select the representative features. However, most of the real-life data is unlabeled and it is costly to apply manual labeling. In this paper, an efficient unsupervised features selection method is proposed for streaming features applications where the number of features increases while the number of instances remains fixed. In particular, unsupervised Feature Selection for Dynamic Features (UFSSF) is developed to determine the representative streaming features without requiring prior knowledge about data class labels or representative features. The UFSSF extends the k-mean clustering to cumulatively determine whether the newly-arrived feature can be selected as a representative streaming feature, or discarded. Experimental results show significant accuracy results and efficient execution time compared to those of other benchmark methods.
Highlights
The high-dimensional data decreases the performance of machine learning algorithms in dynamic features applications
Mitra’s method [13] involves three similarity measures, Least Square Regression Error (LSRE), Pearson Correlation Coefficient (PCC) and Maximal Information Compression Index (MICI)), while SPEC can work with the RBF Kernel similarity measure
This paper developed an unsupervised feature selection method for effective dynamic features which can reduce the dimensionality of streaming features applications, known as the dynamic feature space
Summary
The high-dimensional data decreases the performance of machine learning algorithms in dynamic features applications. Feature selection methods have been applied to identify the representative features of data streams to eliminate obstacles related to data dimensionality. The current feature selection methods cannot be applied effectively for streaming features applications when features are arriving sequentially. In the category of streaming data, there is a dynamic number of instances and there is a fixed number of features. The focus is on the streaming features category where there is a fixed number of instances and a dynamic number of features. These features are processed one-by-one upon their arrival. One real-world application that can be categorized as streaming features
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.