Balanced X-ray Security Dataset and Enhanced YOLO for Contraband Detection
To address critical challenges in X-ray contraband detection—including severe class imbalance in existing datasets, scarcity of high-quality annotated data, and poor model adaptability to complex scenarios—this study first constructs a balanced X-ray contraband detection dataset. Derived from the SIXray and PIDray datasets, the balanced dataset comprises 13,728 images covering 12 different contraband categories. To resolve class imbalance, a Class-Specific Augmentation Framework (CSAF) with four physical transformations and random undersampling are adopted, ensuring approximately 1,500 samples per category for uniform class distribution. Two improved models (ASEA-Net and CSEC-Net) based on YOLOv11s are proposed for lightweight and high-precision contraband detection tasks. Experiments on the balanced dataset show that ASEA-Net achieves 95.78% accuracy and 93.55% mAP@50, outperforming YOLOv11s by 1.46% and 1.37% respectively with 13.37% fewer parameters; CSEC-Net reduces parameters by 39.91% and FLOPs by 40.38% compared to YOLOv11s, enabling deployment on resource-constrained edge devices. Both models exhibit strong performance in complex scenarios, validating the value of the balanced dataset and the effectiveness of the proposed models for X-ray contraband detection.
- Research Article
4
- 10.1016/j.geodrs.2024.e00821
- Jun 15, 2024
- Geoderma Regional
Soil textural class modeling using digital soil mapping approaches: Effect of resampling strategies on imbalanced dataset predictions
- Research Article
30
- 10.1007/s10586-023-04170-z
- Oct 28, 2023
- Cluster Computing
Software defects are a critical issue in software development that can lead to system failures and cause significant financial losses. Predicting software defects is a vital aspect of ensuring software quality. This can significantly impact both saving time and reducing the overall cost of software testing. During the software defect prediction (SDP) process, automated tools attempt to predict defects in the source codes based on software metrics. Several SDP models have been proposed to identify and prevent defects before they occur. In recent years, recurrent neural network (RNN) techniques have gained attention for their ability to handle sequential data and learn complex patterns. Still, these techniques are not always suitable for predicting software defects due to the problem of imbalanced data. To deal with this problem, this study aims to combine a bidirectional long short-term memory (Bi-LSTM) network with oversampling techniques. To establish the effectiveness and efficiency of the proposed model, the experiments have been conducted on benchmark datasets obtained from the PROMISE repository. The experimental results have been compared and evaluated in terms of accuracy, precision, recall, f-measure, Matthew’s correlation coefficient (MCC), the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR) and mean square error (MSE). The average accuracy of the proposed model on the original and balanced datasets (using random oversampling and SMOTE) was 88%, 94%, And 92%, respectively. The results showed that the proposed Bi-LSTM on the balanced datasets (using random oversampling and SMOTE) improves the average accuracy by 6 and 4% compared to the original datasets. The average F-measure of the proposed model on the original and balanced datasets (using random oversampling and SMOTE) were 51%, 94%, And 92%, respectively. The results showed that the proposed Bi-LSTM on the balanced datasets (using random oversampling and SMOTE) improves the average F-measure by 43 and 41% compared to the original datasets. The experimental results demonstrated that combining the Bi-LSTM network with oversampling techniques positively affects defect prediction performance in datasets with imbalanced class distributions.
- Research Article
9
- 10.17485/ijst/v15i17.2339
- May 5, 2022
- Indian Journal of Science and Technology
Background: Class imbalance is often discussed as a strenuous task in the realm of sentiment analysis. In an imbalanced classification, few minority class instances are unable to provide sufficient information, therefore direct learning from an unbalanced dataset can produce unsatisfactory results. This work aims to address the problem of class imbalance. Methods: At primary level this study uses a novel Synthetic Minority Oversampling Technique (SMOTE) for balancing the dataset and then proposes an ensemble model, named Ensemble Bagging Support Vector Machine (EBSVM) for opinion mining. To measure the performance of the particular approach, numerous analyses are conducted on both imbalanced and balanced datasets. Then the work compares the effectiveness of the suggested model with three base classifiers (Nave Bayes (NB), Decision Tree (DT), and Support Vector Machine (SVM)). The customer reviews for restaurants is chose as the dataset for this work. Accuracy, precision, recall and F-measure are used as metrics for evaluation. Findings: According to the results, the suggested EBSVM model excels all other individual classifiers with the imbalanced and SMOTE balanced dataset. The balanced EBSVM classifier improves the imbalanced EBSVM Classifier in terms of accuracy. Precision, recall and F-measure of the minority class in the imbalanced classifiers have been improved in balanced Classifiers. Novelty: The performance of opinion mining classifiers for imbalanced and balanced datasets is evaluated in this paper. The work examines not only general opinions, but also specific aspects such as food, service, ambiance, quality, and price. Comparing the suggested model with existing classification algorithms in the literature, it has found that it outperformed the other models. Keywords: Bagging; Accuracy; Ensemble; Precision; Recall; Fmeasure
- Research Article
31
- 10.3390/info11110519
- Nov 5, 2020
- Information
In this work, we propose a combined sampling technique to improve the performance of imbalanced classification of university student depression data. In experimental results, we found that combined random oversampling with the Tomek links under sampling methods allowed generating a relatively balanced depression dataset without losing significant information. In this case, the random oversampling technique was used for sampling the minority class to balance the number of samples between the datasets. Then, the Tomek links technique was used for undersampling the samples by removing the depression data considered less relevant and noisy. The relatively balanced dataset was classified by random forest. The results show that the overall accuracy in the prediction of adolescent depression data was 94.17%, outperforming the individual sampling technique. Moreover, our proposed method was tested with another dataset for its external validity. This dataset’s predictive accuracy was found to be 93.33%.
- Conference Article
3
- 10.1109/acmi53878.2021.9528132
- Jul 8, 2021
In class imbalanced data set, one class contains more instances than the other class and it is a critical problem in data mining. Many approaches such as oversampling, undersampling, and cost sensitive methods are developed to mitigate the effects of class imbalance but these methods suffer from various shortcomings. In the existing methods, the researchers have hardly used normalization on the imbalanced data set to mitigate the effects. In this work, we implemented two state-of-the-art data balancing methods, Random Undersampling (RUS) and Random Oversampling (ROS), ensembled by AdaBoost algorithm. Then we investigated and compared the two methods with a recently developed approach called Random Splitting data balancing (SplitBal) method with and without applying normalization on the imbalanced data set. For normalization, three well known normalization techniques are used called min-max, z-score and robust-scaling normalization. Our concerned approach, SplitBal is an ensemble method which firstly converts the imbalanced data set into several balanced data set. From the balanced data set, multiple classification models are built and ensembled by max ensemble rule. The empirical analysis using fifteen imbalanced data set elucidates that SplitBal with min-max normalization is dominant over the concerned data balancing methods in this work for Random Forest classifier.
- Research Article
5
- 10.1007/s11227-024-06265-9
- Jun 5, 2024
- The Journal of Supercomputing
Code smells indicate potential symptoms or problems in software due to inefficient design or incomplete implementation. These problems can affect software quality in the long-term. Code smell detection is fundamental to improving software quality and maintainability, reducing software failure risk, and helping to refactor the code. Previous works have applied several prediction methods for code smell detection. However, many of them show that machine learning (ML) and deep learning (DL) techniques are not always suitable for code smell detection due to the problem of imbalanced data. So, data imbalance is the main challenge for ML and DL techniques in detecting code smells. To overcome these challenges, this study aims to present a method for detecting code smell based on DL algorithms (Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU)) combined with data balancing techniques (random oversampling and Tomek links) to mitigate data imbalance issue. To establish the effectiveness of the proposed models, the experiments were conducted on four code smells datasets (God class, data Class, feature envy, and long method) extracted from 74 open-source systems. We compare and evaluate the performance of the models according to seven different performance measures accuracy, precision, recall, f-measure, Matthew’s correlation coefficient (MCC), the area under a receiver operating characteristic curve (AUC), the area under the precision–recall curve (AUCPR) and mean square error (MSE). After comparing the results obtained by the proposed models on the original and balanced data sets, we found out that the best accuracy of 98% was obtained for the Long method by using both models (Bi-LSTM and GRU) on the original datasets, the best accuracy of 100% was obtained for the long method by using both models (Bi-LSTM and GRU) on the balanced datasets (using random oversampling), and the best accuracy 99% was obtained for the long method by using Bi-LSTM model and 99% was obtained for the data class and Feature envy by using GRU model on the balanced datasets (using Tomek links). The results indicate that the use of data balancing techniques had a positive effect on the predictive accuracy of the models presented. The results show that the proposed models can detect the code smells more accurately and effectively.
- Research Article
- 10.58245/ipsi.tir.2402.08
- Jul 1, 2024
- IPSI Transactions on Internet Research
The main challenge in machine learning-based customer response models is the class imbalance problem, i.e. small number of respondents, compared to non-respondents. Aiming to overcome this issue, the approach of preprocessing training data using a Support Vector Machine (SVM), trained on a balanced sample obtained by random undersampling (B-SVM), as well as on a balanced sample obtained by clustering-based undersampling (CB-SVM) was tested. Several classifiers are tested on such a balanced dataset, to compare their predictive performances. The results of this paper demonstrate that the approach effectively preprocesses the training data, and, in turn, reduces noise and overcomes the class imbalance problem. Better predictive performance was achieved compared to standard training data balancing techniques such as undersampling and SMOTE. CB-SVM gives a better sensitivity, while B-SVM gives a better ratio of sensitivity and specificity. Organizations can utilize this approach to balance training data automatically and simply and more efficiently select customers that should be targeted in the next direct marketing campaigns.
- Research Article
1
- 10.32629/jai.v7i4.1021
- Mar 4, 2024
- Journal of Autonomous Intelligence
<p class="Keywords">Early detection of gastric cancer through a Computer-Aided Detection (CAD) system has the potential to significantly reduce the mortality rate associated with this disease. This study aims to investigate the effects of class imbalance on the performance of machine learning classifiers in this context. Using a dataset of 145,787 screening records from NHS Liverpool Hospital, we employed stratified sampling to create balanced and unbalanced datasets and evaluated the performance of four machine learning algorithms—Logistic Regression, Support Vector Machine, Naive Bayes, and Multilayer Perceptron—under five different test conditions. The study’s novelty lies in its detailed examination of class imbalance in gastric cancer diagnosis, emphasizing the crucial role of balanced datasets in machine learning-based early detection systems. For the MLP model under 10-fold cross-validation, the Class 0 sensitivity (non-cancer cases) of the unbalanced dataset was 0.968, higher than the balanced dataset’s 0.902. However, the Class 1 sensitivity (cancer cases) and Positive Predictive Value (PPV) of the unbalanced dataset were much lower (0.383 and 0.527) than those of the balanced dataset (0.959 and 0.907), indicating a significant improvement in identifying true positive cases when using a balanced dataset. These findings highlight the negative effect of class imbalance on prediction accuracy for positive cancer cases and underscore the importance of addressing this imbalance for more reliable and accurate predictions in medical diagnosis and screening. This approach has the potential to improve patient outcomes and may contribute to strategies aimed at reducing the mortality rate associated with gastric cancer.</p>
- Research Article
- 10.1145/3700791
- Oct 29, 2024
- ACM Transactions on the Web
Addressing the challenge of toxic language in online discussions is crucial for the development of effective toxicity detection models. This pioneering work focuses on addressing imbalanced datasets in toxicity detection by introducing a novel approach to augment toxic language data. We create a balanced dataset by instructing fine-tuning of Large Language Models (LLMs) using Reinforcement Learning with Human Feedback (RLHF). Recognizing the challenges in collecting sufficient toxic samples from social media platforms for building a balanced dataset, our methodology involves sentence-level text data augmentation through paraphrasing existing samples using optimized generative LLMs. Leveraging generative LLM, we utilize the Proximal Policy Optimizer (PPO) as the RL algorithm to fine-tune the model further and align it with human feedback. In other words, we start by fine-tuning a LLM using an instruction dataset, specifically tailored for the task of paraphrasing while maintaining semantic consistency. Next, we apply PPO and a reward function, to further fine-tune (optimize) the instruction-tuned LLM. This RL process guides the model in generating toxic responses. We utilize the Google Perspective API as a toxicity evaluator to assess generated responses and assign rewards/penalties accordingly. This approach guides LLMs through PPO and the reward function, transforming minority class samples into augmented versions. The primary goal of our methodology is to create a balanced and diverse dataset to enhance the accuracy and performance of classifiers in identifying instances from the minority class. Utilizing two publicly available toxic datasets, we compared various techniques with our proposed method for generating toxic samples, demonstrating that our approach outperforms all others in producing a higher number of toxic samples. Starting with an initial 16,225 toxic prompts, our method successfully generated 122,951 toxic samples with a toxicity score exceeding 30%. Subsequently, we developed various classifiers using the generated balanced datasets and applied a cost-sensitive learning approach to the original imbalanced dataset. The findings highlight the superior performance of classifiers trained on data generated using our proposed method. These results highlight the importance of employing RL and a data-agnostic model as a reward mechanism for augmenting toxic data, thereby enhancing the robustness of toxicity detection models.
- Research Article
- 10.30534/ijatcse/2024/011352024
- Oct 10, 2024
- International Journal of Advanced Trends in Computer Science and Engineering
The transformer architecture, first introduced in 2017 by researchers at Google, has revolutionized natural language processing in various tasks, including text classification. This architecture formed the basis of future models such as those used in hate speech detection in code-switched text. In this research, we conduct a comparative study of transformer-based models for hate speech detection in English-Kiswahili code-switched text. First, the models were compared as feature extractors using a traditional classifier and then as end-to-end classifiers. The three multilingual transformer-based models compared include mBERT, mDistilBERT and XLM-RoBERTa, using SVM as the traditional classifier for the extracted features. The HateSpeech_Kenya dataset, sourced from Kaggle, was utilized in this study. As a feature extractor, mBERT’s hidden states trained the highest-performing SVM with an accuracy of 0.5461 and a macro f1 score of 0.40. Among the three models evaluated, XLM-RoBERTa achieved the highest accuracy of 0.6069 and a macro f1 score of 0.49 on a balanced dataset. In contrast, mBERT achieved the highest accuracy of 0.7820 and a macro f1 score of 0.53 on an imbalanced dataset. The comparative study establishes that using transformer-based models as end-to-end classifiers generally performs better than using them as feature extractors with traditional classifiers. This is because directly training the models allows them to learn more task-specific features. Furthermore, the varying performance across balanced and imbalanced datasets highlights the need for careful model selection based on the dataset characteristics and specific task requirements.
- Book Chapter
3
- 10.1007/978-3-030-36365-9_8
- Jan 1, 2019
Imbalanced datasets typically occur in many real applications. Resampling is one of the effective solutions due to producing a balanced class distribution. Synthetic Minority Over-sampling technique (SMOTE), an over-sampling technique is used in this study for dealing the imbalanced dataset by add the number of instances of a minority class. This technique is used to decrease the imbalance percentage of the dataset by generating new synthetic samples. Thus, a balanced training dataset is produced to replace the class imbalanced. The balanced datasets were obtained and trained with machine learning algorithms to diagnose the disease’s class. Through the experiment findings on the real-world datasets, oral cancer dataset and erythemato-squamous diseases dataset from the UCI machine learning datasets, an over-sampling method showed better results in clinical disease classification.
- Book Chapter
- 10.4018/979-8-3373-2647-4.ch010
- May 9, 2025
Employee churn is a significant challenge for organizations, leading to substantial costs associated with recruiting, onboarding, and training new employees. High turnover rates can negatively impact overall productivity, employee morale, and organizational stability. Therefore, accurately predicting employee churn is crucial for companies to implement targeted retention strategies, minimize turnover, and reduce associated expenses. In this study, we leveraged machine learning techniques to predict employee churn using the "HR Analytics" dataset from Kaggle. One of the key challenges in churn prediction is class imbalance, where the number of employees who leave is significantly lower than those who stay. To address this, we applied two data-balancing techniques: Synthetic Minority Over-sampling Technique (SMOTE) and Random Over-Sampling (ROS). We then trained and evaluated four machine learning models Logistic Regression, Random Forest, Decision Tree, and Extreme Gradient Boosting (XGBoost) on the balanced datasets. The F1 scores for the SMOTE-balanced data were: Logistic Regression (0.5990), Random Forest (0.9753), Decision Tree (0.9319), and XGBoost (0.9634). Meanwhile, the ROS-balanced data produced F1 scores of: Logistic Regression (0.5978), Random Forest (0.9760), Decision Tree (0.9475), and XGBoost (0.9703). The results demonstrated that ROS yielded superior performance, particularly for the Random Forest and XGBoost models, leading us to select ROS for further hyperparameter tuning. Using RandomizedSearchCV for optimization, the Random Forest model achieved the highest F1 score of 0.9779. Finally, we deployed the optimized Random Forest model via a Flask API, enabling HR professionals to access a user-friendly web interface for realtime churn prediction. This research highlights the effectiveness of machine learning in HR analytics and underscores the practical benefits of predictive modeling in workforce management, helping organizations proactively address employee retetion challenges.
- Conference Article
12
- 10.1109/ecace.2019.8679382
- Feb 1, 2019
Software defect prediction is related to the testing area of software industry. Several methods have been developed for the prediction of bugs in software source codes. The objective of this study is to find the inconsistency of performance between imbalances and balance data set and to find the distinction of performance between single classifier and aggregate classifier (voting). In this investigation, eight publicly available data sets have collected, also seven algorithms and hard voting are used for finding precision, recall and F-1 score to predict software defect. In these collected data, two sets are almost balanced. For this investigation, these balanced data sets have converted into imbalanced sets as average non-defective and defective ratio of the other 6 data sets. The experiment result shows that performance of the two balanced data sets is lower than other six sets. After conversion of two data sets, the performance has increased as like as other six data sets. Another observation is the performance metric that shows the results of precision, recall and F1-score for voting are 0.92, 0.84 and 0.87 respectively, which are better than other single classifier. This study has been able to shows that- imbalance of non-defective and defective classes have a big impact on software defect prediction and the voting is the best performer among the classifiers.
- Research Article
- 10.56714/bjrs.50.2.16
- Dec 31, 2024
- Basrah Researches Sciences
Machine learning (ML) is increasingly indispensable in modern medicine, particularly for disease prediction and improving patient outcomes. This study applies ML techniques to predict thyroid disorders in diabetic patients, a critical task given the frequent co-occurrence and complex interplay between these conditions. six ML classifiers namely Random Forest (RF), Decision Tree (DT), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), and Naive Bayes (NB) were evaluated across three experiments on a local dataset: (1) a balanced dataset using Random Under-Sampling (RUS), (2) a subset of Type 2 diabetes (T2D) patients, and (3) a subset of Type 1 diabetes (T1D) patients. Random Forest classifier consistently outperformed other classifiers, achieving the highest accuracy (0.85) and F1-score (0.83) in the T2D-focused dataset and showing robust performance on the balanced dataset using RUS. These results highlight the suitability of Random Forest for deployment in clinical settings and underscore the importance of balancing techniques like RUS in improving predictive accuracy. However, challenges remain in predicting thyroid disorders among T1D patients due to the low prevalence of thyroid disorders in this group. The findings reinforce the potential of ML in advancing diagnostics and personalized care in diabetic populations.
- Research Article
1
- 10.3390/computers14070283
- Jul 17, 2025
- Computers
The Internet of Things (IoT) holds transformative potential in fields such as power grid optimization, defense networks, and healthcare. However, the constrained processing capacities and resource limitations of IoT networks make them especially susceptible to cyber threats. This study addresses the problem of detecting intrusions in IoT environments by evaluating the performance of deep learning (DL) models under different data and algorithmic conditions. We conducted a comparative analysis of three widely used DL models—Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Bidirectional LSTM (biLSTM)—across four benchmark IoT intrusion detection datasets: BoTIoT, CiCIoT, ToNIoT, and WUSTL-IIoT-2021. Each model was assessed under balanced and imbalanced dataset configurations and evaluated using three loss functions (cross-entropy, focal loss, and dual focal loss). By analyzing model efficacy across these datasets, we highlight the importance of generalizability and adaptability to varied data characteristics that are essential for real-world applications. The results demonstrate that the CNN trained using the cross-entropy loss function consistently outperforms the other models, particularly on balanced datasets. On the other hand, LSTM and biLSTM show strong potential in temporal modeling, but their performance is highly dependent on the characteristics of the dataset. By analyzing the performance of multiple DL models under diverse datasets, this research provides actionable insights for developing secure, interpretable IoT systems that can meet the challenges of designing a secure IoT system.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.