Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data
Utilization of synthetic minority oversampling technique for improving potato yield prediction using remote sensing data and machine learning algorithms with small sample size of yield data
- # Synthetic Minority Oversampling Technique Algorithm
- # Machine Learning Algorithms
- # Synthetic Data
- # Machine Learning
- # Random Forest Regression
- # Nearest Neighbor
- # Performance Of Machine Learning Algorithms
- # Random Forest Regression Algorithm
- # Support Vector Regression
- # Synthetic Minority Oversampling Technique
- Research Article
- 10.53560/ppasa(60-4)820
- Dec 12, 2023
- Proceedings of the Pakistan Academy of Sciences: A. Physical and Computational Sciences
The issue of precise crop prediction gained worldwide attention in the midst of food security concerns. In this study, the efficacies of different machine learning (ML) algorithms, i.e., multiple linear regression (MLR), decision tree regression (DTR), random forest regression (RFR), and support vector regression (SVR) are integrated to predict wheat productivity. The performances of ML algorithms are then measured to get the optimized model. The updated dataset is collected from the Crop Reporting Service for various agronomical constraints. Randomized data partitions, hyper-parametric tuning, complexity analysis, cross-validation measures, learning curves, evaluation metrics and prediction errors are used to get the optimized model. ML model is applied using 75% training dataset and 25% testing datasets. RFR achieved the highest R2 value of 0.90 for the training model, followed by DTR, MLR, and SVR. In the testing model, RFR also achieved an R2 value of 0.74, followed by MLR, DTR, and SVR. The lowest prediction error (P.E) is found for the RFR, followed by DTR, MLR, and SVR. K-Fold cross-validation measures also depict that RFR is an optimized model when compared with DTR, MLR and SVR.
- Research Article
64
- 10.1155/2019/7816154
- Jan 1, 2019
- Mathematical Problems in Engineering
According to the forecast of stock price trends, investors trade stocks. In recent years, many researchers focus on adopting machine learning (ML) algorithms to predict stock price trends. However, their studies were carried out on small stock datasets with limited features, short backtesting period, and no consideration of transaction cost. And their experimental results lack statistical significance test. In this paper, on large‐scale stock datasets, we synthetically evaluate various ML algorithms and observe the daily trading performance of stocks under transaction cost and no transaction cost. Particularly, we use two large datasets of 424 S&P 500 index component stocks (SPICS) and 185 CSI 300 index component stocks (CSICS) from 2010 to 2017 and compare six traditional ML algorithms and six advanced deep neural network (DNN) models on these two datasets, respectively. The experimental results demonstrate that traditional ML algorithms have a better performance in most of the directional evaluation indicators. Unexpectedly, the performance of some traditional ML algorithms is not much worse than that of the best DNN models without considering the transaction cost. Moreover, the trading performance of all ML algorithms is sensitive to the changes of transaction cost. Compared with the traditional ML algorithms, DNN models have better performance considering transaction cost. Meanwhile, the impact of transparent transaction cost and implicit transaction cost on trading performance are different. Our conclusions are significant to choose the best algorithm for stock trading in different markets.
- Research Article
4
- 10.2139/ssrn.3705288
- Jan 1, 2020
- SSRN Electronic Journal
Comparative Performance of Machine Learning Algorithms in Modelling Daylight in Indoor Spaces
- Research Article
24
- 10.1007/s10706-021-01867-z
- Jun 3, 2021
- Geotechnical and Geological Engineering
One of the main challenges that deep mining faces is the occurrence of rockburst phenomena. Rockburst prediction with the use of machine learning (ML) is currently gaining attention, as its prognosis capability in many cases outperforms widely used empirical approaches. However, the required data for conducting any analysis are limited, while also having imbalances in their recorded instances associated with rockburst intensities. These, combined with the multiparametric nature of the phenomenon, can deteriorate the performance of the ML algorithms. This study focuses on the enhancement of the prediction performance of ML algorithms by utilizing the oversampling technique Synthetic Minority Oversampling TEchnique (SMOTE). Five ML algorithms, namely Decision Trees, Naïve Bayes, K-Nearest Neighbor, Random Forest and Logistic Regression, were used in a series of parametric analyses considering different combinations of input parameters, such as the maximum tangential stress, the uniaxial compressive and tensile strength, the stress coefficient, two brittleness coefficients and the elastic energy index. All models kept their hyperparameters fixed, and were trained with the initial dataset, in which synthetic instances were added gradually aiming in the attenuation of a balanced dataset and its further expansion, until the number of synthetic instances reached the number of real data. The assessment of the SMOTE technique is given and its performance is evaluated though the different strategies adopted. The results indicate that SMOTE has a considerable positive effect in the accuracy of the overall classification and especially in the improvement of the within-class classification accuracy, even after the balancing of the dataset.
- Research Article
49
- 10.1186/s40807-023-00078-9
- Jun 19, 2023
- Sustainable Energy Research
Globally, the construction industry is experiencing an increase in energy demand, which has significant environmental and economic repercussions. To address these issues, it is now possible for buildings, vehicles, and renewable energy sources to collaborate and function as an advanced, integrated, and environmentally favorable system that meets the high energy demands of contemporary buildings. To attain maximum efficiency, however, it is necessary to create reliable energy demand forecasting models. In this research, by introducing the energy model of a neighbourhood with buildings with solar panels and electric vehicles, the final balance of energy production and consumption for each building and the whole neighbourhood as a micro grid is predicted. DesignBuilder is used to model neighbourhood buildings, and K-Nearest neighbor (KNN), Regression Support Vector (SVR), Adaptive Boosting (AdaBoost), and Deep neural networks (DNN) algorithms in machine learning are used to predict the final energy balance. a comparative analysis of the performance of the KNN, SVR, AdaBoost, and DNN algorithms was conducted to determine which algorithm is the most effective in predicting energy balance. Finally, the Root Mean Square Error (RMSE) has been used to validate the prediction models. The results show that the KNN, SVR, AdaBoost, and DNN algorithms had RMSE values of 0.56, 0.92, 0.95, and 0.53, respectively. Among these algorithms, the DNN and KNN algorithms had more accurate results than the other used algorithms and as a result of this research, An accurate forecast of neighbourhood energy balance was made. This study takes a novel approach by developing a model that takes into account an integrated system of houses, solar cells, and electric consumption for each building in a neighborhood, which can help to optimize energy consumption and reduce environmental impact.
- Research Article
62
- 10.3390/su12114471
- Jun 1, 2020
- Sustainability
The performance of machine learning (ML) algorithms depends on the nature of the problem at hand. ML-based modeling, therefore, should employ suitable algorithms where optimum results are desired. The purpose of the current study was to explore the potential applications of ML algorithms in modeling daylight in indoor spaces and ultimately identify the optimum algorithm. We thus developed and compared the performance of four common ML algorithms: generalized linear models, deep neural networks, random forest, and gradient boosting models in predicting the distribution of indoor daylight illuminances. We found that deep neural networks, which showed a determination of coefficient (R2) of 0.99, outperformed the other algorithms. Additionally, we explored the use of long short-term memory to forecast the distribution of daylight at a particular future time. Our results show that long short-term memory is accurate and reliable (R2 = 0.92). Our findings provide a basis for discussions on ML algorithms’ use in modeling daylight in indoor spaces, which may ultimately result in efficient tools for estimating daylight performance in the primary stages of building design and daylight control schemes for energy efficiency.
- Research Article
12
- 10.1016/j.saa.2024.124979
- Aug 13, 2024
- Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy
Fast dentification of overlapping fluorescence spectra of oil species based on LDA and two-dimensional convolutional neural network
- Research Article
31
- 10.1016/j.csite.2024.104124
- Feb 12, 2024
- Case Studies in Thermal Engineering
Interactive effects of hyperparameter optimization techniques and data characteristics on the performance of machine learning algorithms for building energy metamodeling
- Research Article
1
- 10.24059/olj.v29i1.4390
- Mar 1, 2025
- Online Learning
Predicting learner performance with precision is critical within educational systems, offering a basis for tailored interventions and instruction. The advent of big data analytics presents an opportunity to employ Machine Learning (ML) techniques to this end. Real-world data availability is often hampered by privacy concerns, prompting a shift towards synthetic data generation. This study presents an empirical comparison of real, synthetic, and mixed (real + synthetic) data sets in forecasting learner performance, deploying an array of regression-based ML algorithms, including Random Forest, Gradient Boosting, XG Boost, K-nearest Neighbor, and Support Vector Regression. Our methodology encompasses the generation of synthetic data via generative model, followed by the application of these algorithms to each data set. The models are evaluated using precision metrics to assess their predictive accuracy. The study unveils that synthetic data can rival real data in predictive capabilities, with combined data sets achieving up to 87.76% accuracy, underscoring the efficacy of hybrid data approaches. These insights advocate for the integration of synthetic data as a practical substitute in scenarios with limited access to real data, fostering advancements in educational technology and ML.
- Research Article
3
- 10.21597/jist.1222764
- Jun 1, 2023
- Iğdır Üniversitesi Fen Bilimleri Enstitüsü Dergisi
Cervical cancer is one of the most successful types of treatment when diagnosed early. In this study, it is aimed to find and classify the disease with data mining methods on the digitized data set obtained as a result of the pap-smear test. Two-stage architecture has been proposed for the diagnosis of cervical cancer. In the first stage of the study, missing data were extracted from the used dataset, and in the second stage, a new dataset was obtained by using the Synthetic Minority Oversampling Technique (SMOTE) algorithm to balance the target classes in the dataset. By applying the majority voting (MV) method to the dataset used in the study, the structure with 4 target variables was reduced to a single target variable. On two data sets, Artificial Neural Network (ANN), Support Vector Machines (SVM), Decision Trees (DT), Random Forest (RF), and K-Nearest Neighbors (KNN) algorithms from data mining methods were used for the diagnosis of cervical cancer. The results obtained from the original dataset and the dataset produced with Smote were compared. ANN is the best method evaluated according to classification success and F-score, and the major voted target variable in the balanced data group produced with the Smote algorithm gave the most successful result. The experimental results showed that the use of MV and SMOTE algorithms together increased the classification success from 93% to 99%.
- Research Article
31
- 10.1016/j.ceramint.2022.06.156
- Jun 16, 2022
- Ceramics International
Exploration of the oxidation and ablation resistance of ultra-high-temperature ceramic coatings using machine learning
- Conference Article
3
- 10.1109/ipccc51483.2021.9679418
- Oct 29, 2021
Wi-Fi fingerprinting techniques for indoor positioning systems (IPS) have been extensively studied due to its high precision and reliability. However, the offline site surveys to collect the updated fingerprints are costly, labourious and time-consuming. Significant efforts have been made to reduce the time-consuming site surveys, such as the use of interpolation techniques and Generative Adversarial Network (GAN) deep learning approaches. A drawback of using GAN is the determination of training sufficiency, whereas for the interpolation, the accuracy of the generated fingerprints can be inadequate. In this paper, a novel fingerprint map construction technique based on the Synthetic Minority Over-sampling Technique (SMOTE) algorithm is proposed to generate synthetic fingerprints in areas that are difficult to reach, or are not regularly visited during offline site surveys. This leads to an imbalanced dataset issue where certain regions are populated with more data points while certain regions are underpopulated. To simulate this situation, a dataset was first augmented to simulate an imbalanced dataset with a minority class and it is rebalanced using SMOTE algorithm. Experimental results show that the proposed scheme can achieve similar accuracy and Root Mean Square Error (RMSE) as the original dataset without SMOTE being applied. Although the accuracy deteriorates as more synthetic data is produced, it remains within an acceptable range of 0.64%. As a result, we can overcome the imbalanced datasets problems for IPS and build a fingerprint database with fewer data points using SMOTE-generated synthetic data to reduce the cost of data collection.
- Research Article
- 10.52756/ijerr.2024.v45spl.005
- Nov 30, 2024
- International Journal of Experimental Research and Review
Analyzing user interface (UI) bugs is an important step taken by testers and developers to assess the usability of the software product. UI bug classification helps in understanding the nature and cause of software failures. Manually classifying thousands of bugs is an inefficient and tedious job for both testers and developers. Objective of this research is to develop a classification model for the User Interface (UI) related bugs using supervised Machine Learning (ML) algorithms and Natural Language Processing (NLP) techniques. Also, to assess the effect of different sampling and feature vectorization techniques on the performance of ML algorithms. Classification is based upon ‘Summary’ feature of the bug report and utilizes six classifiers i.e., Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes (MNB), Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF) and Gradient Boosting (GB). Dataset obtained is vectored using two vectorization techniques of NLP i.e., Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). ML models are trained after vectorization and data balancing. The models ' hyperparameter tuning (HT) has also been done using the grid search approach to improve their efficacy. This work provides a comparative performance analysis of ML techniques using Accuracy, Precision, Recall and F1 Score. Performance results showed that a UI bug classification model can be built by training a tuned SVM classifier using TF-IDF and SMOTE (Synthetic Minority Oversampling Techniques). SVM classifier provided the highest performance measure with Accuracy: 0.88, Precision: 0.86, Recall: 0.85 and F1: 0.85. Result also inferred that the performance of ML algorithms with TF-IDF is better than BoW in most cases. This work provides classification of bugs that are related to only the user interface. Also, the effect of two different feature extraction techniques and sampling techniques on algorithms were analyzed, adding novelty to the research work.
- Research Article
2
- 10.3389/fendo.2025.1486350
- Mar 20, 2025
- Frontiers in endocrinology
Medication adherence plays a crucial role in determining the health outcomes of patients, particularly those with chronic conditions like type 2 diabetes. Despite its significance, there is limited evidence regarding the use of machine learning (ML) algorithms to predict medication adherence within the Ethiopian population. The primary objective of this study was to develop and evaluate ML models designed to classify and monitor medication adherence levels among patients with type 2 diabetes in Ethiopia, to improve patient care and health outcomes. Using a random sampling technique in a cross-sectional study, we obtained data from 403 patients with type 2 diabetes at the University of Gondar Comprehensive Specialized Hospital (UoGCSH), excluding 13 subjects who were unable to respond and 6 with incomplete data from an initial cohort of 422. Medication adherence was assessed using the General Medication Adherence Scale (GMAS), an eleven-item Likert scale questionnaire. The responses served as features to train and test machine learning (ML) models. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was applied. The dataset was split using stratified K-fold cross-validation to preserve the distribution of adherence levels. Eight widely used ML algorithms were employed to develop the models, and their performance was evaluated using metrics such as accuracy, precision, recall, and F1 score. The best-performing model was subsequently deployed for further analysis. Out of 422 enrolled patients, 403 data samples were collected, with 11 features extracted from each respondent. To mitigate potential class imbalance, the dataset was increased to 620 samples using the Synthetic Minority Over-sampling Technique (SMOTE). Machine learning models including Logistic Regression (LR), Support Vector Machine (SVM), K Nearest Neighbor (KNN), Decision Tree (DT), Random Forest (RF), Gradient Boost Classifier (GBC), Multilayer Perceptron (MLP), and 1D Convolutional Neural Network (1DCNN) were developed and evaluated. Although the performance differences among the models were subtle (within a range of 0.001), the SVM classifier outperformed the others, achieving a recall of 0.9979 and an AUC of 0.9998. Consequently, the SVM model was selected for deployment to monitor and detect patients' medication adherence levels, enabling timely interventions to improve patient outcomes. This study highlights a variety of machine learning (ML) models that can be effectively used to monitor and classify medication adherence in diabetic patients in Ethiopia. However, to fully realize the potential impact of digital health applications, further studies that include patients from diverse settings are necessary. Such research could enhance the generalizability of these models and provide insights into the broader applicability of digital tools for improving medication adherence and patient outcomes in varying healthcare contexts.
- Research Article
1
- 10.1108/k-01-2024-0188
- Oct 28, 2024
- Kybernetes
Purpose This study aims to tackle the critical issue of detecting stock market manipulation, which undermines the integrity and stability of financial markets globally. Even enhanced with machine learning, traditional statistical methods often struggle to analyze high-frequency trading data effectively due to inherent noise and the limited availability of publicly known manipulation cases. This leads to poor model generalization and a tendency toward over-fitting. Focusing on China's securities market, our study introduces an innovative approach that employs deep learning-based high-frequency jump tests to overcome these challenges and to develop a more effective method for identifying manipulative activities. Design/methodology/approach We employed the “Jump Variation – Time-of-Day” (JV-TOD) non-parametric technique for jump tests on high-frequency data, coupled with the synthetic minority over-sampling technique (SMOTE) algorithm for re-balancing sample data. Our approach trains a deep neural network (DNN) on refined data to enhance its ability to identify manipulation patterns accurately. Findings Our results show that the deep neural network model, calibrated with high-frequency price jump data, identifies manipulation behavior more specifically and accurately than traditional models. The model achieved an accuracy rate of 94.64%, an F1-score of 95.26% and a recall rate of 95.88%, significantly outperforming traditional models. These results demonstrate the effectiveness of our approach in mitigating over-fitting and improving the robustness of market manipulation detection. Practical implications The proposed model provides regulatory entities and financial institutions with a more efficient tool to monitor and counteract market manipulation, thereby improving market fairness and investor protection. Originality/value By integrating the JV-TOD jump test with deep learning, this study proposed a new approach to market manipulation detection. The innovation is in its capacity to detect subtle manipulation signals that traditional methods typically overlook. Our model, which is trained on jump test data enhanced by the SMOTE algorithm, excels at learning complex manipulation patterns. This enhances both detection accuracy and robustness. In contrast to existing methods that are challenged by the noisy and intricate nature of high-frequency data, our approach shows enhanced performance in identifying nuanced market manipulations, offering a more effective and reliable method for detecting market manipulation.