Data Transformation Methods Research Articles

The performance of software defect prediction (SDP) models determines the priority of test resource allocation. Researchers also use interpretability techniques to gain empirical knowledge about software quality from SDP models. However, SDP methods designed in the past research rarely consider the impact of data transformation methods, simple but commonly used preprocessing techniques, on the performance and interpretability of SDP models. Therefore, in this paper, we investigate the impact of three data transformation methods (Log, Minmax, and Z-score) on the performance and interpretability of SDP models. Through empirical research on (i) six classification techniques (random forest, decision tree, logistic regression, Naive Bayes, K-nearest neighbors, and multilayer perceptron), (ii) six performance evaluation indicators (Accuracy, Precision, Recall, F1, MCC, and AUC), (iii) two interpretable methods (permutation and SHAP), (iv) two feature importance measures (Top-k feature rank overlap and difference), and (v) three datasets (Promise, Relink, and AEEEM), our results show that the data transformation methods can significantly improve the performance of the SDP models and greatly affect the variation of the most important features. Specifically, the impact of data transformation methods on the performance and interpretability of SDP models depends on the classification techniques and evaluation indicators. We observe that log transformation improves NB model performance by 7%–61% on the other five indicators with a 5% drop in Precision. Minmax and Z-score transformation improves NB model performance by 2%–9% across all indicators. However, all three transformation methods lead to substantial changes in the Top-5 important feature ranks, with differences exceeding 2 in 40%–80% of cases (detailed results available in the main content). Based on our findings, we recommend that (1) considering the impact of data transformation methods on model performance and interpretability when designing SDP approaches as transformations can improve model accuracy, and potentially obscure important features, which lead to challenges in interpretation, (2) conducting comparative experiments with and without the transformations to validate the effectiveness of proposed methods which are designed to improve the prediction performance, and (3) tracking changes in the most important features before and after applying data transformation methods to ensure precise and traceable interpretability conclusions to gain insights. Our study reminds researchers and practitioners of the need for comprehensive considerations even when using other similar simple data processing methods.

Read full abstract

Training-anomaly-based, machine-learning-based, intrusion detection systems (AMiDS) for use in critical Internet of Things (CioT) systems and military Internet of Things (MioT) environments may involve synthetic data or publicly simulated data due to data restrictions, data scarcity, or both. However, synthetic data can be unrealistic and potentially biased, and simulated data are invariably static, unrealistic, and prone to obsolescence. Building an AMiDS logical model to predict the deviation from normal behavior in MioT and CioT devices operating at the sensing or perception layer due to adversarial attacks often requires the model to be trained using current and realistic data. Unfortunately, while real-time data are realistic and relevant, they are largely imbalanced. Imbalanced data have a skewed class distribution and low-similarity index, thus hindering the model’s ability to recognize important features in the dataset and make accurate predictions. Data-driven learning using data sampling, resampling, and generative methods can lessen the adverse impact of a data imbalance on the AMiDS model’s performance and prediction accuracy. Generative methods enable passive adversarial learning. This paper investigates several data sampling, resampling, and generative methods. It examines their impacts on the performance and prediction accuracy of AMiDS models trained using imbalanced data drawn from the UNSW_2018_IoT_Botnet dataset, a publicly available IoT dataset from the IEEEDataPort. Furthermore, it evaluates the performance and predictability of these models when trained using data transformation methods, such as normalization and one-hot encoding, to cover a skewed distribution, data sampling and resampling methods to address data imbalances, and generative methods to train the models to increase the model’s robustness to recognize new but similar attacks. In this initial study, we focus on CioT systems and train PCA-based and oSVM-based AMiDS models constructed using low-complexity PCA and one-class SVM (oSVM) ML algorithms to fit an imbalanced ground truth IoT dataset. Overall, we consider the rare event prediction case where the minority class distribution is disproportionately low compared to the majority class distribution. We plan to use transfer learning in future studies to generalize our initial findings to the MioT environment. We focus on CioT systems and MioT environments instead of traditional or non-critical IoT environments due to the stringent low energy, the minimal response time constraints, and the variety of low-power, situational-aware (or both) things operating at the sensing or perception layer in a highly complex and open environment.

Read full abstract

Data Transformation Methods Research Articles

Related Topics

Articles published on Data Transformation Methods

A globally optimized fault diagnosis model based on generative flow model for imbalanced data

Optimizing a Dynamic Vehicle Routing Problem with Deep Reinforcement Learning: Analyzing State-Space Components

The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Digital Twin Smart City: Integrating IFC and CityGML with Semantic Graph for Advanced 3D City Model Visualization

Research on LiDAR point cloud data transformation method based on weighted altitude difference map

Prediction of Student Dropout in Malaysian’s Private Higher Education Institute using Data Mining Application

Development and external validation of deep learning clinical prediction models using variable-length time series data.

Towards enhanced surface roughness modeling in machining: an analysis of data transformation techniques

A Note on Zero-Sum Two-Person Undiscounted One Player Control Semi-Markov Games

Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity

Comparison of surface-wave techniques to estimate S- and P-wave velocity models from active seismic data

On the use of machine learning and data-transformation methods to predict hydration kinetics and strength of alkali-activated mine tailings-based binders

Improving Anomaly Classification using Combined Data Transformation and Machine Learning Methods

Multi-Stimulus Least-Squares Transformation With Online Adaptation Scheme to Reduce Calibration Effort for SSVEP-Based BCIs.

Hyperspectral imaging detects biological stress of wheat for early diagnosis of crown rot disease

Evaluating the Impact of Data Transformation Techniques on the Performance and Interpretability of Software Defect Prediction Models

Evaluating The Data Analytics For Finance And Insurance Sectors For Industry 4.0

Selection of strawberry cultivars according to their productivity and berry quality using normalized indices

Studying Imbalanced Learning for Anomaly-Based Intelligent IDS for Mission-Critical Internet of Things

Advancing chirality analysis through enhanced enantiomer characterization and quantification via fast Fourier transform capacitance voltammetry

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Transformation Methods Research Articles

Related Topics

Articles published on Data Transformation Methods

A globally optimized fault diagnosis model based on generative flow model for imbalanced data

Optimizing a Dynamic Vehicle Routing Problem with Deep Reinforcement Learning: Analyzing State-Space Components

The effectiveness of data pre-processing methods on the performance of machine learning techniques using RF, SVR, Cubist and SGB: a study on undrained shear strength prediction

Digital Twin Smart City: Integrating IFC and CityGML with Semantic Graph for Advanced 3D City Model Visualization

Research on LiDAR point cloud data transformation method based on weighted altitude difference map

Prediction of Student Dropout in Malaysian’s Private Higher Education Institute using Data Mining Application

Development and external validation of deep learning clinical prediction models using variable-length time series data.

Towards enhanced surface roughness modeling in machining: an analysis of data transformation techniques

A Note on Zero-Sum Two-Person Undiscounted One Player Control Semi-Markov Games

Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity

Comparison of surface-wave techniques to estimate S- and P-wave velocity models from active seismic data

On the use of machine learning and data-transformation methods to predict hydration kinetics and strength of alkali-activated mine tailings-based binders

Improving Anomaly Classification using Combined Data Transformation and Machine Learning Methods

Multi-Stimulus Least-Squares Transformation With Online Adaptation Scheme to Reduce Calibration Effort for SSVEP-Based BCIs.

Hyperspectral imaging detects biological stress of wheat for early diagnosis of crown rot disease

Evaluating the Impact of Data Transformation Techniques on the Performance and Interpretability of Software Defect Prediction Models

Evaluating The Data Analytics For Finance And Insurance Sectors For Industry 4.0

Selection of strawberry cultivars according to their productivity and berry quality using normalized indices

Studying Imbalanced Learning for Anomaly-Based Intelligent IDS for Mission-Critical Internet of Things

Advancing chirality analysis through enhanced enantiomer characterization and quantification via fast Fourier transform capacitance voltammetry