Improving time efficiency in big data through progressive sampling-based classification model

Nandita Bangera,Shubham Luharuka,Asha S Manek,Kayarvizhy Kayarvizhy

doi:10.11591/ijeecs.v33.i1.pp248-260

Nandita Bangera, Shubham Luharuka + Show 2 more

Open Access

https://doi.org/10.11591/ijeecs.v33.i1.pp248-260

Copy DOI

Abstract

<span>The proposed system aims to overcome challenges posed by large databases, data imbalance, heterogeneity, and multidimensionality through progressive sampling as a novel classification model. It leverages sampling techniques to enhance processing performance and overcome memory restrictions. The random forest regressor feature importance technique with the gini significance method is employed to identify important characteristics, reducing the data’s features for classification. The system utilizes diverse classifiers such as random forest, ensemble learning, support vector machine (SVM), k-nearest neighbors’ algorithm (KNN), and logistic regression, allowing flexibility in handling different data types and achieving high accuracy in classification tasks. By iteratively applying progressive sampling to the dataset with the best features, the proposed technique aims to significantly improve performance compared to using the entire dataset. This approach focuses computational resources on the most informative subsets of data, reducing time complexity. Results show that the system can achieve over 85% accuracy even with only 5-10% of the original data size, providing accurate predictions while reducing data processing requirements. In conclusion, the proposed system combines progressive sampling, feature selection using random forest regressor feature importance (RFRFI-PS), and a range of classifiers to address challenges in large databases and improve classification accuracy. It demonstrates promising results in accuracy and time complexity reduction.</span>

Full Text