The Effect Of Chi-Square Feature Selection On The Naive Bayes Algorithm In Analyzing The Sentiment Of Gojek Application Reviews On Google Play Store
This study analyzes customer sentiment in reviewing the Gojek application to find out whether Chi-Square feature selection can improve the performance of the sentiment analysis model. This study uses 12,000 Gojek review data, starting with labeling positive, negative, or neutral based on user ratings of the reviews. Naive Bayes with and without Chi-Square feature selection is used in testing related to accuracy, precision, recall, and F1 score. The best performance is obtained by using alpha 0.5 combined with the best 2000 Chi-Square features, which produces 86.96% accuracy, 87.84% precision, 86.96% recall, and 85.29% F1 score on imbalanced data. SMOTE is also used to handle the low number of neutral reviews, but it produces lower accuracy. In conclusion, Chi-Square feature selection in the Naive Bayes algorithm can improve model accuracy on imbalanced and balanced datasets.
- Research Article
12
- 10.3390/agronomy14123001
- Dec 17, 2024
- Agronomy
Crop diseases pose a significant threat to global food security, with both economic and environmental consequences. Early and accurate detection is essential for timely intervention and sustainable farming. This paper presents a review of machine learning (ML) and deep learning (DL) techniques for crop disease diagnosis, focusing on Support Vector Machines (SVMs), Random Forest (RF), k-Nearest Neighbors (KNNs), and deep models like VGG16, ResNet50, and DenseNet121. The review method includes an in-depth analysis of algorithm performance using key metrics such as accuracy, precision, recall, and F1 score across various datasets. We also highlight the data imbalances in commonly used datasets, particularly PlantVillage, and discuss the challenges posed by these imbalances. The research highlights critical insights regarding ML and DL models in crop disease detection. A primary challenge identified is the imbalance in the PlantVillage dataset, with a high number of healthy images and a strong bias toward certain disease categories like fungi, leaving other categories like mites and molds underrepresented. This imbalance complicates model generalization, indicating a need for preprocessing steps to enhance performance. This study also shows that combining Vision Transformers (ViTs) with Green Chromatic Coordinates and hybridizing these with SVM achieves high classification accuracy, emphasizing the value of advanced feature extraction techniques in improving model efficacy. In terms of comparative performance, DL architectures like ResNet50, VGG16, and convolutional neural network demonstrated robust accuracy (95–99%) across diverse datasets, underscoring their effectiveness in managing complex image data. Additionally, traditional ML models exhibited varied strengths; for instance, SVM performed better on balanced datasets, while RF excelled with imbalanced data. Preprocessing methods like K-means clustering, Fuzzy C-Means, and PCA, along with ensemble approaches, further improved model accuracy. Lastly, the study underscores that high-quality, well-labeled datasets, stakeholder involvement, and comprehensive evaluation metrics such as F1 score and precision are crucial for optimizing ML and DL models, making them more effective for real-world applications in sustainable agriculture.
- Research Article
- 10.11591/csit.v5i2.pp112-121
- Jul 1, 2024
- Computer Science and Information Technologies
Identifying the genus of fungi is known to facilitate the discovery of new medicinal compounds. Currently, the isolation and identification process is predominantly conducted in the laboratory using molecular samples. However, mastering this process requires specific skills, making it a challenging task. Apart from that, the rapid and highly accurate identification of fungus microbes remains a persistent challenge. Here, we employ a deep learning technique to classify fungus images for both balanced and imbalanced datasets. This research used transfer learning to classify fungus from the genera Aspergillus, Cladosporium, and Fusarium using InceptionV3 model. Two experiments were run using the balanced dataset and the imbalanced dataset, respectively. Thorough experiments were conducted and model effectiveness was evaluated with standard metrics such as accuracy, precision, recall, and F1 score. Using the trendline of deviation knew the optimum result of the epoch in each experimental model. The evaluation results show that both experiments have good accuracy, precision, recall, and F1 score. A range of epochs in the accuracy and loss trendline curve can be found through the experiment with the balanced, even though the imbalanced dataset experiment could not. However, the validation results are still quite accurate even close to the balanced dataset accuracy.
- Research Article
- 10.11591/csit.v5i2.p112-121
- Jul 1, 2024
- Computer Science and Information Technologies
Identifying the genus of fungi is known to facilitate the discovery of new medicinal compounds. Currently, the isolation and identification process is predominantly conducted in the laboratory using molecular samples. However, mastering this process requires specific skills, making it a challenging task. Apart from that, the rapid and highly accurate identification of fungus microbes remains a persistent challenge. Here, we employ a deep learning technique to classify fungus images for both balanced and imbalanced datasets. This research used transfer learning to classify fungus from the genera Aspergillus, Cladosporium, and Fusarium using InceptionV3 model. Two experiments were run using the balanced dataset and the imbalanced dataset, respectively. Thorough experiments were conducted and model effectiveness was evaluated with standard metrics such as accuracy, precision, recall, and F1 score. Using the trendline of deviation knew the optimum result of the epoch in each experimental model. The evaluation results show that both experiments have good accuracy, precision, recall, and F1 score. A range of epochs in the accuracy and loss trendline curve can be found through the experiment with the balanced, even though the imbalanced dataset experiment could not. However, the validation results are still quite accurate even close to the balanced dataset accuracy.
- Research Article
- 10.30865/mib.v8i3.7886
- Jul 27, 2024
- JURNAL MEDIA INFORMATIKA BUDIDARMA
Floods are one of the natural disasters that frequently occur in Indonesia. The city of Samarinda is affected by floods every year, resulting in significant losses. The data used in this study comes from the Regional Disaster Management Agency (BPBD) and the Meteorology, Climatology, and Geophysics Agency (BMKG) for the years 2021-2023 in Samarinda. This data includes 11 attributes and 1095 records. Previous studies on data mining related to floods have been conducted. However, issues arise with high-dimensional data and data imbalance. High dimensionality leads to overfitting and reduced accuracy, while imbalanced data causes overfitting to the majority class and inaccurate representation. This study aims to improve the accuracy of the Naive Bayes algorithm in predicting high-dimensional and imbalanced flood data. The approach involves using the Chi-Square feature selection technique and oversampling with the Synthetic Minority Over-sampling Technique (SMOTE). Chi-Square is used to find optimal features for predicting floods and to enhance the accuracy of the Naive Bayes algorithm in predicting high-dimensional and imbalanced flood data. The validation method used is 10-fold cross-validation, and a confusion matrix model is employed to calculate accuracy values. The results of the study show that Chi-Square can identify four best features: average humidity (rh_avg), rainfall (rr), maximum wind direction (ddd_x), and most frequent wind direction (ddd_car). The use of the Naive Bayes algorithm with SMOTE achieved an accuracy of 71.58%. However, after applying Chi-Square feature selection, the accuracy dropped to 60.82%. This decline is attributed to the reduced number of minority classes after feature selection. Therefore, Chi-Square feature selection is not sufficiently effective in improving the accuracy of Naive Bayes on high-dimensional data.
- Research Article
- 10.1515/jisys-2024-0406
- Dec 4, 2025
- Journal of Intelligent Systems
Problem : Data imbalance in medical datasets poses significant challenges for the performance of machine learning models, particularly in classifying Alzheimer’s disease (AD). Aim : This study aims to investigate the impact of the data ratio on model performance using both balanced and imbalanced datasets. Methods : We employed two distinct datasets: a balanced set of 34,000 images created through augmentation techniques and an inherently imbalanced set of 6,400 images, both comprising four classes. To evaluate model performance, we utilized three state-of-the-art models: fine-tuned vision transformer (FT-ViT), fine-tuned convolutional neural network (FT-CNN), and fine-tuned swin transformer (FT-Swin). Results : The FT-ViT model achieved an impressive 99% accuracy on the imbalanced dataset and 96% on the balanced dataset. The FT-CNN model attained 97% accuracy on the imbalanced dataset and 90% on the balanced dataset, while the FT-Swin model exhibited a performance disparity, achieving 79% accuracy on the balanced dataset and 90% on the imbalanced dataset. Conclusion : Our findings demonstrate that careful model selection, fine-tuning, and hyperparameter optimization can lead to high performance on imbalanced datasets without relying solely on artificial balancing methods. This approach offers promising implications for AD classification and potentially other medical imaging applications facing similar data imbalance challenges.
- Conference Article
13
- 10.1109/citsm47753.2019.8965332
- Nov 1, 2019
The main problem in using a sentiment analysis algorithm Naive Bayes is sensitivity to the selection of features. There exist Chi-Square feature selections to eliminate features that are not very influential. This study aimed to determine the effect of Chi-Square feature selection on the performance Naive Bayes algorithm in analyzing sentiment documents. Data were taken from Corpus v1.0 Indonesian Movie Review 700 training data and 30 test data. Testing was done by analyzing sentiment documents with and without a Chi-Square feature selection. The evaluated subsequently by the method of accuracy, precision, and recall. The result from the analysis of sentiment without feature selection obtained 73.33% accuracy, precision 100.00%, 65.21% recall. While the Chi-Square feature selection (significance level $a$ 0.1) obtained 93.33% accuracy results, Precision 93.33%, and 93.33% recall. From these results, it can be seen that the selection of Chi-Square features affects performance Naive Bayes algorithm in analyzing sentiment documents.
- Research Article
47
- 10.3390/electronics12132856
- Jun 28, 2023
- Electronics
Parkinson’s disease is the second-most common cause of death and disability as well as the most prevalent neurological disorder. In the last 15 years, the number of cases of PD has doubled. The accurate detection of PD in the early stages is one of the most challenging tasks to ensure individuals can continue to live with as little interference as possible. Yet there are not enough trained neurologists around the world to detect Parkinson’s disease in its early stages. Machine learning methods based on Artificial intelligence have acquired a lot of popularity over the past few decades in medical disease detection. However, these methods do not provide an accurate and timely diagnosis. The overall detection accuracy of machine learning-related models is inadequate. This study collected data from 31 male and female patients, including 195 voices. Approximately six recordings were created per patient, with the length of each recording extending from 1 to 36 s. These voices were recorded in a soundproof studio using an Industrial Acoustics Company (IAC) AKG-C420 head-mounted microphone. The data set was collected to investigate the diagnostic significance of speech and voice abnormalities caused by Parkinson’s disease. An imbalanced dataset is the main contributor of model overfitting and generalization errors, and hence one class has the majority of samples and the other class has minority samples. This problem is addressed in this study by utilizing the three sampling techniques. After balancing the datasets, each class has the same number of samples, which has proven valuable in improving the model’s performance and reducing the overfitting problem. Four performance metrics such as accuracy, precision, recall and f1 score are used to evaluate the effectiveness of the proposed hybrid model. Experiments demonstrated that the proposed model achieved 100% accuracy, recall and f1 score using the balanced dataset with the random oversampling technique and 100% precision, 97% recall, 99% AUC score and 91% f1 score with the SMOTE technique.
- Research Article
- 10.2174/0113892010366485250415101928
- Apr 21, 2025
- Current pharmaceutical biotechnology
Hemophilia 'A' (HA) is a genetic blood disorder characterized by a deficiency of Factor VIII (FVIII), with treatment often triggering the development of neutralizing antibodies (inhibitors) to FVIII. Predicting the development of these inhibitors is crucial for clinical applications but presents significant computational challenges due to data imbalance, skewed data, and inadequate data sanitization. This study aimed to develop a machine-learning/AI approach to find biomarkers and predict the development of inhibitors to Factor VIII in patients with Hemophilia 'A,' addressing the challenges associated with data imbalance and enhancing prediction accuracy. The data were sanitized and encoded for prediction, and the Random Over-sampling (ROS) technique was employed to resolve data imbalance in the CHAMP dataset. Several machine- learning classification models, including Random Forest, XG Boost, Cat Boost, Logistic Regression, Gradient Boosting, and Light GBM, were utilized. Hyperparameters were tuned using GridSearchCV optimization with a stratified k-fold approach. The performance of the models was evaluated based on accuracy, precision, recall, and F1 scores. The Random Forest model was further analyzed using an explainable AI (XAI) tool known as SHAP (SHapley Additive exPlanations) to identify the variables influencing model performance. The Random Forest model outperformed other classifiers, achieving a mean accuracy of 97.37%, along with closely aligned precision, recall, and F1 scores. The XAI tool SHAP facilitated the ranking of variables Clinical Severity, Variant Type, Exon, HGVS cDNA, hg19 Coordinates, and others according to their impact on the model's predictions. Additionally, the study identified biomarkers associated with FVIII inhibition. This study presents a breakthrough in the early prediction of inhibitor development in Hemophilia 'A' patients, paving the way for personalized and effective treatment programs. The integration of the preprocessing pipeline, Random Forest model, and SHAP analysis offers a novel solution for guiding treatment strategies for HA patients, which could significantly enhance the development of targeted and effective therapies.
- Research Article
- 10.62411/jais.v9i1.9695
- Apr 21, 2025
- Journal of Applied Intelligent System
Spam email is a problem that disturbs and harms the recipient. Machine learning is widely used in overcoming email spam because of its ability to classify emails into spam or non-spam. In this research, the Naïve Bayes algorithm is initiated with the Chi-Squared selection feature to classify spam emails. So that the implementation is able to increase accuracy for better performance in classification. The feature selection method is used to direct the model's attention to features that are related to the target variable. In this study, the chi squared feature uses a value of K = 2500, with an accuracy of 98.83% which shows an increase in model performance compared to previous research. So that the Naïve Bayes model with the Chi-Squared selection feature is proven to provide better performance.
- Conference Article
2
- 10.1109/icmew.2019.0-112
- Jul 1, 2019
Predicting surgical complications can improve shared decision making by surgeons and patients. Recently, the use of machine learning algorithms for predicting complications has gained much attention. In this study, we used the American college of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database to compare the performance of five machine learning algorithms for predicting complications during spine surgery. The database included 173449 patients who underwent spine surgery. To thoroughly evaluate and compare the proposed machine learning algorithms, the dataset was balanced and the algorithms were applied on both the balanced and imbalanced dataset. The results indicated that no significant difference was found between the AUCs for machine learning models of the imbalanced and balanced dataset. However, when the f1 score was considered as a metric, the performance of the machine learning models trained with the balanced dataset had significantly outperformed those algorithms trained with the imbalanced dataset.
- Research Article
3
- 10.1051/bioconf/20249700076
- Jan 1, 2024
- BIO Web of Conferences
The motivation behind this study stems from identifying contemporary challenges associated with prosecuting electronic financial crimes. Highlights ongoing efforts to identify and address credit card fraud and fraud as there are many credit card fraud issues in the financial industry. Traditional methods are no longer able to keep up with modern methods of tracking the behavior of credit card users and detecting suspicious cases. Artificial intelligence technology offers promising solutions to quickly detect and prevent future fraud by credit card users. Datasets used to detect financial anomalies are affected by imbalances in financial transactions, and this study aims to address the imbalance of financial fraud datasets using adversarial algorithm techniques and compare them with the most commonly used methods in the scientific literature.The results showed that the function of the adversarial algorithm is consistent in several ways, including allowing researchers and interested parties to determine data growth rates, which helps bring the dataset closer to real-time data from financial markets and banks. This study proposes a hybrid machine learning model consisting of three machine learning algorithms: decision trees, logistic regression, and Naive Bayes algorithm, and calculates performance metrics such as accuracy, specificity, precision, and F1 score. Experimental results reveal varying degrees of accuracy in fraud detection. Model testing using the SMOTE method recorded an accuracy of 98.1% and an F-score of 98.3%. On the other hand, the oversampling and under sampling test methods showed similar performance, with the two methods recording an accuracy of 94.3 and 95.3 and an F-score of 94.7 and 95.1, respectively. Finally, the GAN method excelled, receiving a test score and accuracy of 99.9%, as well as exceptional precision, recall, and F1 score. As a result, we conclude that the GAN method is able to balance the data set, which in turn is reflected in the performance of the model in training and the accuracy of predictions when tested. Historical transaction analysis identifies behavioral patterns and adapts to evolving fraud techniques. This approach enhances transaction security and protects against potential financial losses due to fraud. This contribution allows financial institutions and companies to proactively combat fraudulent activities.
- Research Article
10
- 10.1088/1757-899x/546/5/052059
- Jun 1, 2019
- IOP Conference Series: Materials Science and Engineering
Diabetes mellitus or commonly referred as diabetes is a metabolic disorder caused by high blood sugar level and the pancreas does not produce insulin effectively. Diabetes can lead to relentless disease such as blindness, kidney failure, and heart attacks. Early detection is needed in order for the patients to prevent the disease being more severe. According to the non-normality and huge dataset in medical data, some researchers use classification methods to predict symptoms or diagnose patients. In this study, Learning Vector Quantization (LVQ) is used to classify the diabetes dataset with Chi-Square for feature selection. The result of the experiment shows that the best accuracy is achieved at 80% and 90% of the data training and the performance measurement, which are precision, recall, and f1 score are the highest when the model contains all the features in the dataset.
- Research Article
- 10.46647/ijetms.2023.v07i04.090
- Jan 1, 2023
- international journal of engineering technology and management sciences
This research study investigates the detection of partisan bias in political social media posts through the application of the Naive Bayes algorithm. The CrowdFlower Political Social Media Posts dataset is utilized, comprising a collection of labelled posts from diverse political affiliations. The primary objective of this research is to develop an automated system that can effectively classify political posts based on their partisan biases. The study employs data pre-processing techniques, feature extraction methods, and the Naive Bayes algorithm to evaluate the performance of this approach. The findings of this research showcase the potential for accurate detection of partisan bias, contributing to a deeper understanding of political discourse on social media platforms. In order to achieve the research objectives, the study begins by exploring the prevalence of partisan bias in political discussions on social media and the subsequent influence on public opinion. A comprehensive review of text classification algorithms is conducted, highlighting the effectiveness and suitability of the Naive Bayes algorithm for this particular task. The research methodology encompasses multiple stages, including data pre-processing to standardize the text data, feature extraction using the bag-of-words approach, and training a classification model with the Naive Bayes algorithm. The model's performance is evaluated using various metrics such as accuracy, precision, recall, and F1 score.
- Research Article
1
- 10.1007/s10489-017-1049-2
- Aug 19, 2017
- Applied Intelligence
Multi-class contour preserving classification is a contour conservancy technique that synthesizes two types of vectors; fundamental multi-class outpost vectors (FMCOVs) and additional multi-class outpost vectors (AMCOVs), at the judging border between classes of data to improve the classification accuracy of the feed-forward neural network. However, the number of both new vectors is tremendous, resulting in a significantly prolonged training time. Reduced multi-class contour preserving classification provides three practical methods to lessen the number of FMCOVs and AMCOVs. Nevertheless, the three reduced multi-class outpost vector methods are serial and therefore have limited applicability on modern machines with multiple CPU cores or processors. This paper presents the methodologies and the frameworks of the three parallel reduced multi-class outpost vector methods that can effectively utilize thread-level parallelism and process-level parallelism to (1) substantially lessen the number of FMCOVs and AMCOVs, (2) efficiently increase the speedups in execution times to be proportional to the number of available CPU cores or processors, and (3) significantly increase the classification performance (accuracy, precision, recall, and F1 score) of the feed-forward neural network. The experiments carried out on the balanced and imbalanced real-world multi-class data sets downloaded from the UCI machine learning repository confirmed the reduction performance, the speedups, and the classification performance aforementioned.
- Research Article
1
- 10.70393/616a6e73.333530
- Dec 3, 2025
- Academic Journal of Natural Science
Currently, credit card fraud detection is a unique problem in the financial sector, with both institutions and consumers facing increasingly significant losses. Despite the growing application of machine learning (ML) techniques in this domain, existing methods often struggle with issues such as high false-positive rates, imbalanced data, and the complexity of evolving fraud patterns. This paper investigates the comparative performance of various machine learning models in credit card fraud detection, focusing on traditional models (such as Support Vector Machine, Decision Tree), ensemble methods (Random Forest, XGBoost,deep learning models (Multilayer Perceptron, Artificial Neural Networks). Three distinct datasets, including both balanced and imbalanced sets, are used to evaluate these models. The results indicate that ensemble models like Random Forest and XGBoost demonstrate superior performance, particularly in terms of accuracy, precision, recall, and F1 score, when compared to other models. However, models such as Support Vector Machine and Artificial Neural Networks exhibit lower recall in imbalanced datasets, suggesting potential limitations in their application to real-world fraud detection scenarios. This study also identifies key challenges, such as the difficulty in adapting to dynamic fraud strategies and the need for real-time monitoring. Future research directions are proposed, including the integration of deep learning architectures and adaptive learning mechanisms to enhance the detection system’s real-time response and accuracy. The findings provide a robust foundation for further development of credit card fraud detection systems and offer practical insights for financial institutions seeking to mitigate fraud-related risks.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.