In the era of data-driven decision-making, managing dynamic data streams characterised by evolving data distributions and high dimensionality presents a formidable challenge for online feature selection. This research addresses the challenge by developing innovative solutions in optimising Online Feature Selection (OFS) to manage feature irrelevancy and redundancy and rigorously validating the proposed method in high-dimensional dynamic data streams. The research employs a structured methodology, introducing a novel method: a Dynamic Particle Swarm Optimisation (PSO)-based Threshold Optimisation method. Dynamic PSO, injected on three benchmark methods of OFS (i.e., Online Streaming Feature Selection (OSFS), Fast-OSFS, and Scalable and Accurate Online Feature Selection for Big Data (SAOLA)), enabled the global best position to be dynamically adjusted to exploit the best solution found in streaming data. This research presents two optimisation variants of the proposed method: RedundantPSO, which optimises redundant features, and IrrelevantPSO, which optimises irrelevant features. Unlike the traditional PSO method on feature selection that uses feature encoding representation, the proposed method is underpinned by two contributions: adaptive threshold particle representation of particle swarm optimisation and enhanced fitness function using minimisation of mean absolute deviation of dependency among feature subsets. Adaptive threshold particle representation combines the feature encoding part with a novel aspect that defines a threshold value of significance level ranging from 0.01 to 0.1. This unique contribution sets the research apart in the field where it enables the adaptive adjustment of the threshold value based on incoming features. Next, the adaptation of Mean Absolute Deviation (MAD) was integrated into the fitness evaluation of PSO to gain a more accurate and reliable measure of fitness for feature selection. During the experiment phase, we analysed various benchmark datasets with highly redundant and relevant behaviour. Our analysis concluded that selecting the appropriate threshold values significantly improved model performance for high-redundancy datasets, highlighting the need for careful threshold selection. The experimental evaluations have revealed that integrating RedundantPSO with OSFS (OSFS+RedundantPSO) resulted in a remarkable enhancement of the OSFS method's accuracy, achieving an impressive average accuracy rate of 76.8%. This substantial improvement includes occasional spikes of up to 3.8% over the baseline OSFS accuracy, showcasing OSFS+RedundantPSO as the top-performing combination. Furthermore, Fast-OSFS + RedundantPSO outperformed Fast-OSFS by a slight margin, reaching an average accuracy of 72.7%, while SAOLA + RedundantPSO exhibited a substantial 3.1% increase in average accuracy over SAOLA, reaching 74.0%. It is noteworthy to highlight that the threshold value searched in the proposed method also significantly impacted the identification of the behaviour of the dataset, either high relevancy or redundancy, even in the absence of prior domain knowledge. A higher threshold signifies the evaluation of a more redundant feature space. In conclusion, the results demonstrated the significant contributions of the method in enhancing model accuracy, adapting to evolving data distributions, and optimising feature subsets with acceptable runtime. The research aims to advance the field of data science, such as cybersecurity, finance, healthcare and more, while empowering end-users to make informed decisions under changing data stream circumstances.
Read full abstract