STUDENT GRADUATION TIME PREDICTION USING LOGISTIC REGRESSION, DECISION TREE, SUPPORT VECTOR MACHINE, AND ADABOOST ENSEMBLE LEARNING
Universities in Indonesia are working hard to improve the graduation rates of their students as it is considered a measure of success and quality in terms of accreditation. This study focuses on analyzing the effectiveness of machine learning algorithms, regression, Support Vector Machine (SVM) Decision Tree and ensemble learning, with AdaBoost wether the Computer Science students will graduate on time or not. The data used for this analysis consists of student records from 2015 to 2019. Includes 14 variables. To understand the relationships between these variables a two-dimensional visualization called a Heatmap was employed. The research findings indicate that the Support Vector Machine (SVM) and AdaBoost Decision Tree (DT) algorithm performs better than the other algorithms. The Decision Tree and AdaBoost (DT) model achieved an F1- score of 0,76 and 0,82. This research contributes towards enhancing education management by facilitating decision making to ensure timely graduation, for student
- Research Article
9
- 10.1016/j.foodcont.2024.110604
- May 29, 2024
- Food Control
Establishing the traceability of meat products has been a major focus of food science in recent decades. In this context, recent advances in food nutritional biomarker identification and improvements in statistical technology have allowed for more accurate identification and classification of food products. Moreover, artificial intelligence has now provided a new opportunity for optimizing existing methods to identify animal products. This study presents a comparative analysis of the effectiveness of different machine learning algorithms based on raw data from analyses of organoleptic, sensory and nutritional meat traits to differentiate categories of commercial lamb from an indigenous Spanish breed (Mallorquina breed) obtained from the following production systems: suckling lambs; light lambs from grazing; and light lambs from grazing supplemented with grain. Six machine learning algorithms were evaluated: Artificial Neural Network (ANN), Decision Tree, K-Nearest Neighbours (KNN), Naive Bayes, Multinomial Logistic Regression, and Support Vector Machine (SVM). For each algorithm, we tested three datasets, namely organoleptic traits and sensorial traits (CIELAB colour, water holding capacity, Warner-Bratzler shear force, volatile compounds and trained tasters), and nutritional traits (proximate composition and fatty acid profile). We also tested a combination of all three datasets. All the data were combined into a dataset with 144 variables resulting from the meat characterization, which included 11,232 event records. The ANN algorithm stood out for its high score with each of the three datasets used. In fact, we obtained an overall accuracy of 0.88, 0.83, and 0.88 for the organoleptic-sensory, nutritional, and combined datasets, respectively. The effectiveness of using the SVM algorithm to assign categories of lambs according to its production system performed better with nutritional traits and the full characterization, with performances equal to those obtained with ANN. The KNN algorithm showed the worst performance, with overall accuracies of 0.54 or lower for each of the datasets used. The results of this study demonstrate that machine learning is a useful tool for classifying commercial lamb carcasses. In fact, the ANN and SVM algorithms could be proposed as tools for differentiating categories of lamb production based on the organoleptic, sensory and nutritional characteristics of Mediterranean light lambs' meat. However, in order to improve the traceability methods of lamb meat production systems as a guarantee for consumers and to improve the learning processes used by these algorithms, more studies along these lines with other lamb breeds are required.
- Research Article
- 10.30853/phil20240264
- Jun 13, 2024
- Philology. Theory and Practice
The aim of the study is to determine the optimal classifier for identifying an emotional state based on the results of a comparative analysis of the effectiveness of various machine learning algorithms based on a combination of prosodic and spectral features. The scientific novelty consists in the application of ML algorithms in the recognition of emotionally marked speech of North Caucasian bilinguals in the problem of binary classification of the presence or absence of an accent with the determination of the optimal combination of universal prosodic and spectral features. During the study, an experimental corpus of speech of representatives of three ethnic groups (Russians, Kabardians and Armenians) was created with an annotation of the degree of accent, prosodic (94 signs) and spectral (74 signs) characteristics were extracted from speech signals, a comparative analysis of the effectiveness of machine learning algorithms (logistic regression, k-nearest neighbors, the method of support vectors, decision trees) in the problem of binary classification of the presence/absence of emphasis. The results of the study showed that at the syllabic level, the most effective is the decision tree model with combined features, and at the phrasal level, the k-nearest neighbor model with prosodic features. Universal prosodic features that form the basis of the "language model of emotions" were identified, as well as typological differences in their implementation, reflecting the influence of the native language on the emotional speech of bilinguals.
- Research Article
- 10.18137/cardiometry.2022.25.872877
- Feb 14, 2023
- CARDIOMETRY
Aim: The major goal of this research is to improve the accuracy of the Decision Tree (DT) and Support vector machine (SVM) algorithms and compare their efficiency in detecting breast cancer tumors. Materials and Methods: This work depends on the data obtained from the UCI Machine Learning Repository and used to acquire the data sets for the research of Innovative breast cancer prediction using machine learning algorithms. The sample size of breast cancer prediction involves two groups: Decision tree (N=20) and Support vector machine (N=20) according to clincalc.com by keeping 0.05 alpha error-threshold, 95% confidence interval, enrollment ratio as 0:1, and 80% G power. The accuracy, sensitivity, and precision are calculated using MATLAB software. Result: The accuracy of the DT is 83.83% (p<0.001) while the accuracy rate of the Support vector machine is 97.50%. The Decision tree outcomes have a sensitivity and precision rate of 87.46% (p<0.001) and 84.13% (p<0.001) respectively, whereas the Support vector machine sensitivity and the precision rate are 95.83% and 100% respectively. Conclusion: Support vector machine algorithm performed significantly better with improved accuracy of 97.50% for breast cancer prediction.
- Research Article
- 10.47065/bits.v6i1.5278
- Jun 23, 2024
- Building of Informatics, Technology and Science (BITS)
The number of stock exchange investors in Indonesia reached 5.34 million by the end of December 2023. This figure is dominated by millennial generation investors, indicating a growing confidence in the fundamentals and economic prospects of the Indonesian capital market. However, the lack of financial literacy among this generation often results in ineffective and high-risk investments. Many millennials choose stocks based on short-term trends or recommendations that lack analysis. To address this issue, a more structured approach to stock selection is required. One method that can be employed is the classification of a company's performance based on its performance using various financial indicators and ratios. As the performance of a company affects the movement of its stock value, this research will compare Support Vector Machine and Decision Tree with the One Against All approach in classifying company performance. The features used for the classification of company performance consist of three financial ratios: profitability (ROA), liquidity (CR), and leverage (DER). The labels or targets in the classification are divided into three categories: normal, good, and unfavorable. This research will consider evaluations such as accuracy, cross validation, and confusion matrix. The results of the Support Vector Machine (SVM) algorithm demonstrated an accuracy of 86.67%, while the Decision Tree (DT) algorithm exhibited an accuracy of 93.33%. Consequently, the DT algorithm produced more accurate results than the SVM algorithm in classification. The number of stock exchange investors in Indonesia reached 5.34 million by the end of December 2023. This figure is dominated by millennial generation investors, indicating a growing confidence in the fundamentals and economic prospects of the Indonesian capital market. However, the lack of financial literacy among this generation often results in ineffective and high-risk investments. Many millennials choose stocks based on short-term trends or recommendations that lack analysis. To address this issue, a more structured approach to stock selection is required. One method that can be employed is the classification of a company's performance based on its performance using various financial indicators and ratios. As the performance of a company affects the movement of its stock value, this research will compare Support Vector Machine and Decision Tree with the One Against All approach in classifying company performance. The features used for the classification of company performance consist of three financial ratios: profitability (ROA), liquidity (CR), and leverage (DER). The labels or targets in the classification are divided into three categories: normal, good, and unfavorable. This research will consider evaluations such as accuracy, cross validation, and confusion matrix. The results of the Support Vector Machine (SVM) algorithm demonstrated an accuracy of 86.67%, while the Decision Tree (DT) algorithm exhibited an accuracy of 93.33%. Consequently, the DT algorithm produced more accurate results than the SVM algorithm in classification.
- Book Chapter
2
- 10.3233/apc220028
- Nov 3, 2022
The iris dataset will be classified using the support vector machine and decision tree algorithms. flower dataset identifies the pattern and classifies it. The dataset has 150 rows and 5 attributes, which contains 50 samples from each species. There are three species in this dataset. Iris flower classification can be performed using support vector machines and decision tree algorithms. SVM stands for Support Vector Machine, and is a supervised machine learning technique that can be used for classification and regression. The Decision Tree algorithm is a simple approach mainly used for classification and prediction. The sample size has been determined to be 20 for both the groups using G Power 80%. The Support Vector Machine algorithm provides a mean accuracy of 98.09% when compared to the Decision Tree algorithm, with a mean accuracy of 95.55%. A statistically insignificant difference was observed between the Decision Tree and the Support Vector Machine, p = 0.92 (> 0.05) based on 2-tailed analysis. In the classification of Iris flowers, the Support Vector Machine outperformed the Decision Tree Algorithm.
- Research Article
38
- 10.1155/2021/7194728
- Jul 19, 2021
- Mathematical Problems in Engineering
In ordinary credit card datasets, there are far fewer fraudulent transactions than ordinary transactions. In dealing with the credit card imbalance problem, the ideal solution must have low bias and low variance. The paper aims to provide an in-depth experimental investigation of the effect of using a hybrid data-point approach to resolve the class misclassification problem in imbalanced credit card datasets. The goal of the research was to use a novel technique to manage unbalanced datasets to improve the effectiveness of machine learning algorithms in detecting fraud or anomalous patterns in huge volumes of financial transaction records where the class distribution was imbalanced. The paper proposed using random forest and a hybrid data-point approach combining feature selection with Near Miss-based undersampling technique. We assessed the proposed method on two imbalanced credit card datasets, namely, the European Credit Card dataset and the UCI Credit Card dataset. The experimental results were reported using performance matrices. We compared the classification results of logistic regression, support vector machine, decision tree, and random forest before and after using our approach. The findings showed that the proposed approach improved the predictive accuracy of the logistic regression, support vector machine, decision tree, and random forest algorithms in credit card datasets. Furthermore, we found that, out of the four algorithms, the random forest produced the best results.
- Conference Article
1
- 10.56952/arma-2022-0031
- Jun 26, 2022
ABSTRACT: Full characterization of the reservoir formations using a limited number of data is a promising process that can be granted through the integration of machine learning algorithms. Under certain circumstances, the measurement of reservoir properties could be an expensive operation and using empirical correlations to estimate these properties may not solve the problem. In this work, an artificial neural network algorithm was developed using MATLAB software to predict the porosity, the volume of shale and water saturation logs of well F14 from the Volve Field. Through multiple estimations of mean squared error, we induce that the most optimal number of hidden neurons within this input dataset is 10. The test results show that the correlations for porosity, shale volume and water saturation are around 0.997, 0.998 and 0.866, respectively. This indicates the perfect matching of the predictions with the actual data. Besides, supervised classification of the geological layers was done using decision trees and support vector machine algorithms. The optimum number of branches that construct the decision tree is found to be 20. The best quality of fitting was obtained using decision trees algorithm with observed accuracy and actual accuracy of 89.6% and 61.2%, respectively. 1. INTRODUCTION Nowadays, the collection and processing of a large set of data for multiple investigations related to addressing the industrial problems represent the most challenging issues, and applying conventional analyses may not be appropriate for extracting useful information due to the time-consuming at each operation and the higher complexity of the process. For this purpose, a lot of research was devoted to handling these problems through the integration of data mining as a major concept for the treatment and the interpretation of a variety of results in a more accurate way (Sharma and Sharma, 2018; Angra and Ahuja, 2017; Das, Dey, et al., 2015). Thus, machine learning has gotten increasing attention, especially in the field of petroleum engineering. This technique lies in finding the correlations and the rules that can best describe the behavior of the outputs in line with the expected change in inputs properties. However, many algorithms were developed for general purposes and then applied to certain studies related to oil production enhancement and several domains in petroleum engineering (Khan, Alnuaim, et al., 2019; Hegde and Gray, 2017). This includes the prediction of wellbore logs and the composition of earth formation, which is more promising in terms of reducing the number of dispenses that can be generated by the extensive measures that cover a large scale of investigation, while it is efficient and sufficient that the full identification of the formation can be done at a narrower range of depth. Furthermore, assuming a general model that can be parametrized under the most common change in properties of nearby wells in a particular region is highly recommended as an alternative method that can limit the implantation of real logs measurements in the non-exploitable area (Saputro et al., 2016; Anifowose et al., 2017, Ala Eddine et al., 2022). To do this, it is necessary to use specific algorithms that can be valid to perform these types of predictions. Several researchers reported that the utilization of Artificial Neural Network (ANN) in the prediction of wireline logs has resulted in a good fit with what can be measured using the field’s equipment (Mohaghegh, 2000; Gharib, Elsakka, et al., 2018; Baneshi, Behzadijo, et al., 2013). This has granted a new optimized method for the characterization of formations in a more efficient and accessible way. However, the extracted correlation from these approaches is needed for a better understanding of the plausible contribution of each input to the overall wireline logs. The purpose of the present study is to investigate the correlation between particular logs and the distribution of porosity, shale volume, and water saturation using ANN, based on a comparison between the obtained results and the was recorded so far using real-world instruments. The paper includes also the study of the variation in the composition of the formation as a function of depth by applying a classification method with two predicted classes of rocks including sandstone and carbonate. For the sake of this achievement, support vector machine (SVM) and decision trees (DT) algorithms were developed separately for finding the best algorithm that can give higher accuracy in terms of predictions.
- Research Article
1
- 10.5755/j01.erem.79.4.33913
- Dec 22, 2023
- Environmental Research, Engineering and Management
Air pollution, particularly fine particulate matter with a diameter of 2.5 micrometers or less (PM2.5), is a significant public health concern in many regions worldwide, including the northeastern region of Thailand. This study investigates the correlation between PM2.5 concentrations and meteorological spatial datasets such as surface relative humidity (SRH), surface wind speed (SPD), visibility (Vis), surface temperature (ST), and aerosol optical thickness (AOT) in the region. GIS techniques and the inverse distance weighting technique were used to create spatial maps of the meteorological datasets and ground station PM2.5 measurements. Pearson correlation analysis was performed to examine the relationship between PM2.5 and the meteorological datasets. Decision tree and support vector machine (SVM) algorithms were employed to estimate PM2.5 concentrations based on the spatial datasets. The results showed that Vis and ST have a moderate positive linear relationship with PM2.5, while AOT has a moderate negative linear relationship. SRH and SPD have weak relationships with PM2.5. The decision tree and SVM algorithms demonstrated a strong positive correlation between estimated and measured PM2.5 concentrations. The study shows that machine learning algorithms can be effective tools for estimating PM2.5 concentration based on AOT data, and feature selection can improve model performance. Ensemble learning could be employed to further improve model performance, particularly in regions with high spatial variability. Overall, the study provides a promising approach for estimating PM2.5 concentration using machine learning algorithms and AOT data.
- Research Article
9
- 10.1080/15397734.2020.1763184
- May 18, 2020
- Mechanics Based Design of Structures and Machines
In this work, a new attempt has been made using machine learning algorithms for assessing failure mode of austempered ductile iron perforated plates. This aims at providing some insights into these problems by comparing the performance of machine learning models which are part of artificial intelligence. The ballistic performance could be assessed by k-nearest neighbors (KNN), support vector machine (SVM), logistic regression, and decision tree (DT) algorithms. Precision of KNN, SVM, logistic regression and DT models is found to be 0.75, 0.75, 0.8, and 1, respectively. F1 score of KNN, SVM, logistic regression and DT models is found to be 0.86, 0.86, 0.89, and 1, respectively for smooth bulge formation. Eventually, the DT model is established and the optimal prediction model is derived by fine-tuning the parameters.
- Research Article
6
- 10.1002/hsr2.2266
- Jul 1, 2024
- Health science reports
Death due to covid-19 is one of the biggest health challenges in the world. There are many models that can predict death due to COVID-19. This study aimed to fit and compare Decision Tree (DT), Support Vector Machine (SVM), and AdaBoost models to predict death due to COVID-19. To describe the variables, mean (SD) and frequency (%) were reported. To determine the relationship between the variables and the death caused by COVID-19, chi-square test was performed with a significance level of 0.05. To compare DT, SVM and AdaBoost models for predicting death due to COVID-19 from sensitivity, specificity, accuracy and the area under the rock curve under R software using psych, caTools, random over-sampling examples, rpart, rpartplot packages was done. Out of the total of 23,054 patients studied, 10,935 cases (46.5%) were women, and 12,569 cases (53.5%) were men. Additionally, the mean age of the patients was 54.9 ± 21.0 years. There is a statistically significant relationship between gender, fever, cough, muscle pain, smell and taste, abdominal pain, nausea and vomiting, diarrhea, anorexia, dizziness, chest pain, intubation, cancer, diabetes, chronic blood disease, Violation of immunity, pregnancy, Dialysis, chronic lung disease with the death of covid-19 patients showed (p < 0.05). The results showed that the sensitivity, specificity, accuracy and the area under the receiver operating characteristic curve were respectively 0.60, 0.68, 0.71, and 0.75 in the DT model, 0.54, 0.62, 0.63, and 0.71 in the SVM model, and 0.59, 0.65, 0.69 and 0.74 in the AdaBoost model. The results showed that DT had a high predictive power compared to other data mining models. Therefore, it is suggested to researchers in different fields to use DT to predict the studied variables. Also, it is suggested to use other approaches such as random forest or XGBoost to improve the accuracy in future studies.
- Research Article
- 10.1080/19479832.2025.2498349
- Apr 29, 2025
- International Journal of Image and Data Fusion
Machine learning (ML) algorithms are used to estimate vegetation cover in humid environments. However, their effectiveness in arid and semiarid environments remains limited, particularly in rugged mountainous areas. This study aimed to evaluate the effectiveness of ML algorithms in estimating and classifying montane vegetation cover and monitoring spatiotemporal changes in vegetation cover from 2005 to 2020 using high-resolution satellite images (SPOT-5/6/7). All data were processed and analysed using the maximum likelihood classifier (MLC), random forest (RF), and support vector machine (SVM) algorithms, and accuracy assessment. The results of the ML algorithms did not significantly differ because the highest agreement percentage between the RF and SVM algorithms was 90.27%. The vegetation coverage of the study area was 37.93% according to the SVM algorithm. Approximately 21.13% of the region’s total area was covered with scattered low green plants and the vegetation cover was highest in 2005 and lowest in 2010. The SVM algorithm classified the vegetation cover types with high efficiency and divided them into forests, agricultural lands with limited spread, shrubs, and grasses, covering 27%. This research highlights the necessity of remote sensing and ML techniques in monitoring montane vegetation cover and evaluating their effectiveness in improving accuracy.
- Conference Article
4
- 10.1109/icscds56580.2023.10104831
- Mar 23, 2023
Banks make the majority of their income from loans. A lot of individuals apply for loans, and it is difficult to choose the real candidate who will repay the loan. A lot of misunderstandings may occur when selecting the real applicant when the process is done manually. As a result, a loan prediction system based on machine learning is developed, in which the system will automatically identify the qualified candidates. This is beneficial to both the bank personnel and the applicant. The loan approval process will be greatly shortened. The loan data is predicted by using the hybrid model of Naive Bayes (NB) and Decision Tree (DT) algorithms. First, the dataset is given to the three classification algorithms– Support Vector Machine (SVM), NB and DT Algorithms and the prediction is done with these three algorithms. The accuracy of each of these three is used to assess performance. The creation of the hybrid model increases accuracy. The dataset is given to NB for training and the prediction of NB is given to DT Algorithm for training. Test data are sent to the model for prediction after training. The model is evaluated, and the performance is measured in terms of different metrics form sklearn metrics. This prediction of loan range is useful for bank staff to give the loan amount accordingly. The NB algorithm checks for equality and independence of all the features in the dataset. In DT algorithm, the tree is constructed based on the information gain value. The attribute with high information gain value is placed as the root node and also the other nodes are constructed based on information gain value. The proposed hybrid model predicts - yes or no, and based on the prediction, whether the loan is to be sanctioned or denied for the applicant is specified.
- Research Article
2
- 10.1111/tgis.13265
- Oct 15, 2024
- Transactions in GIS
ABSTRACTThe data required for sustainable forest planning is provided by traditional forest inventories, which are labor, time, and cost‐intensive. Providing this data quickly, reliably, and accurately is crucial for planners and researchers. The objective of this study was to predict stand basal area (BA), stand volume (V), and quadratic mean diameter (dq) by leveraging vegetation indices (VIs) and reflectance (R) derived from Landsat 8 OLI and Sentinel 2 satellite images, along with topographic (T) data obtained from ALOS‐PALSAR satellite imagery. Forest inventory data for a total of 250 sample plots were used for modeling in the study. Stand parameters were estimated using support vector machines (SVM), multiple linear regression (MLR), decision tree (DT), and random forest (RF) algorithms. In modeling V, BA, and dq, both individual and combinations of R, VIs, and T values obtained from satellite imagery were used as independent variables. Using the generated datasets, each of the stand parameters was modeled separately with MLR, SVM, RF, and DT algorithms, and the success of the models was compared to determine the modeling technique and dataset with the highest success for the relevant parameter. The results showed that for each stand parameter, the highest model success was achieved in the combined dataset, which was created by combining all datasets. However, in terms of modeling techniques, the highest success for each stand parameter was achieved with different modeling techniques. The highest success for V is obtained in the model using the SVM method (R2 = 0.78; RMSE = 0.28 m3/ha), the RF method yielded the highest model performance for BA (R2 = 0.70; RMSE = 2.53 m2/ha), and finally, the highest success for dq was obtained in the DT method (R2 = 0.74; RMSE = 0.02 cm). In general, the datasets obtained from Sentinel 2 images showed higher model success than the datasets obtained from Landsat 8 OLI images.
- Research Article
- 10.24014/ijaidm.v6i1.19966
- Apr 11, 2023
- Indonesian Journal of Artificial Intelligence and Data Mining
The spread of COVID-19 in Indonesia has caused many negative impacts. Therefore, the government is taking vaccination measures to suppress the spread of COVID-19. Public response to vaccinations on Twitter has been mixed, with some supporting it and some not. The data for this study comes from the Twitter feed of the drone portal Emprit Academy (dea). Classification is performed using SVM, decision tree and Naive Bayes algorithm. The purpose of this study is to inform the public about whether vaccination against COVID-19 is inclined toward positive, neutral, or negative opinions. Moreover, this study compares the accuracy of the three algorithms used, namely Naive Bayes (NB), Support Vector Machine (SVM) and Decision Tree, and the validation performed using the K-Fold Cross-Validation method, AdaBoost feature selection, and the TF-IDF Transformer feature extraction test. The result obtained from this study is that the accuracy of the 90:10 data keeps improving, dividing by 82.86% on the SVM algorithm, 81.43% on the Naive Bayes and 78.57% on the decision tree.
- Research Article
- 10.30871/jaic.v9i6.9806
- Dec 6, 2025
- Journal of Applied Informatics and Computing
The widespread adoption of digital wallets like LinkAja in Indonesia has led to a surge in user-generated reviews, which are valuable for assessing service quality. This study compares the classification performance of Support Vector Machine (SVM) and Decision Tree algorithms on user reviews from the LinkAja application. 7.000 reviews were gathered through web scraping and processed with standard text cleaning, tokenization, stopword removal, and stemming, resulting in 6,261 usable entries. These were divided into training and testing sets in a 70:30 ratio. The performance of each algorithm was evaluated both before and after the application of Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance. Prior to SMOTE, SVM recorded an accuracy of 77.97%, precision of 0.74, recall of 0.33, and F1 score of 0.45, while Decision Tree reached 72.01% accuracy, 0.50 precision, 0.62 recall, and 0.55 F1 score. After SMOTE, SVM accuracy slightly improved to 78.29%, with notable increases in recall (0.74) and F1 score (0.60); Decision Tree also saw an accuracy rise to 74.56% but experienced a slight decline in F1 score to 0.52. These findings demonstrate that SVM, particularly when used with SMOTE, offers better overall performance and class balance in classifying reviews with imbalanced sentiment distribution, making it more suitable than Decision Tree for this application.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.