Optimizing Text Classification Using Techniques AdaBoost Ensemble with Decision Tree Algorithm
This study presents an optimized text classification framework combining AdaBoost ensemble techniques with Decision Tree algorithms (ID3, C4.5, CART) to address critical challenges in small dataset scenarios (n=795 Indonesian-language reviews). Employing rigorous five-fold stratified cross-validation (random seed=42), we implemented a comprehensive preprocessing pipeline including case normalization, language-specific stemming, and TF-IDF feature extraction. The ensemble model utilized 50 AdaBoost iterations with a learning rate of 1.0, evaluated through multiple performance metrics while accounting for class imbalance effects. Key results demonstrate significant performance enhancements, with the C4.5+AdaBoost configuration achieving 96.72% accuracy (±0.88), representing a 10.6 percentage point improvement over the base C4.5 algorithm. The ensemble approach particularly improved minority class identification, boosting positive sentiment classification F1-scores by 0.28 points while maintaining exceptional neutral sentiment detection (F1-score 0.99±0.00). Comparative analysis revealed consistent advantages across all Decision Tree variants, with accuracy improvements of 18.6% for ID3, 10.6% for C4.5, and 14.2% for CART, alongside reduced performance variance (62-73% decrease). While these findings validate AdaBoost's effectiveness for enhancing Decision Tree stability in small-scale text classification, the study acknowledges limitations regarding sample size constraints and language specificity. The research contributes practical methodologies for sentiment analysis applications while emphasizing the need for validation on larger, more diverse datasets. Future work should explore comparative benchmarking against transformer architectures. Advanced feature representation techniques and multilingual generalization testing. This work provides a reproducible framework for developing robust, ensemble-based text classification systems in resource-constrained scenarios.
- Book Chapter
- 10.1007/978-3-030-03991-2_41
- Jan 1, 2018
Test and evaluation is a process that is used to determine if a product/system satisfies its performance specifications across its entire operating regime. The operating regime is typically defined using factors such as types of terrains/sea-states/altitudes, weather conditions, operating speeds, etc., and involves multiple performance metrics. With each test being expensive to conduct and with multiple factors and performance metrics under consideration, design of a test and evaluation schedule is far from trivial. Design of experiments (DOE) still continues to be the most prevalent approach to derive the test plans, although there is significant opportunity to improve this practice through optimization. In this paper, we introduce a surrogate-assisted optimization approach to uncover the performance envelope with a small number of tests. The approach relies on principles of decomposition to deal with multiple performance metrics and employs bi-directional search along each reference vector to identify the best and worst performance simultaneously. To limit the number of tests, the search is guided by multiple surrogate models. At every iteration the approach delivers a test plan involving at most \(K_T\) tests, and the information acquired is used to generate future test plans. In order to evaluate the performance of the proposed approach, a set of scalable test functions with various Pareto front characteristics and objective space bias are introduced. The performance of the approach is quantitatively assessed and compared with two popular DOE strategies, namely Latin Hypercube Sampling (LHS) and Full Factorial Design (FFD). Further, we also demonstrate its practical use on a simulated catapult system.
- Research Article
44
- 10.3390/rs15010016
- Dec 21, 2022
- Remote Sensing
Sembilang National Park, one of the best and largest mangrove areas in Indonesia, is very vulnerable to disturbance by community activities. Changes in the dynamic condition of mangrove forests in Sembilang National Park must be quickly and easily accompanied by mangrove monitoring efforts. One way to monitor mangrove forests is to use remote sensing technology. Recently, machine-learning classification techniques have been widely used to classify mangrove forests. This study aims to investigate the ability of decision tree (DT) and random forest (RF) machine-learning algorithms to determine the mangrove forest distribution in Sembilang National Park. The satellite data used are Landsat-7 ETM+ acquired on 30 June 2002 and Landsat-8 OLI acquired on 9 September 2019, as well as supporting data such as SPOT 6/7 image acquired in 2020–2021, MERIT DEM and an existing mangrove map. The pre-processing includes radiometric and atmospheric corrections performed using the semi-automatic classification plugin contained in Quantum GIS. We applied decision tree and random forest algorithms to classify the mangrove forest. In the DT algorithm, threshold analysis is carried out to obtain the most optimal threshold value in distinguishing mangrove and non-mangrove objects. Here, the use of DT and RF algorithms involves several important parameters, namely, the normalized difference moisture index (NDMI), normalized difference soil index (NDSI), near-infrared (NIR) band, and digital elevation model (DEM) data. The results of DT and RF classification from Landsat-7 ETM+ and Landsat-8 OLI images show similarities regarding mangrove spatial distribution. The DT classification algorithm with the parameter combination NDMI + NDSI + DEM is very effective in classifying Landsat-7 ETM+ image, while the parameter combination NDMI + NIR is very effective in classifying Landsat-8 OLI image. The RF classification algorithm with the parameter Image (6 bands), the number of trees = 100, the number of variables predictor (mtry) is square root (√k), and the minimum number of node sizes = 6, provides the highest overall accuracy for Landsat-7 ETM+ image, while combining Image (7 bands) + NDMI + NDSI + DEM parameters with the number of trees = 100, mtry = all variables (k), and the minimum node size = 6 provides the highest overall accuracy for Landsat-8 OLI image. The overall classification accuracy is higher when using the RF algorithm (99.12%) instead of DT (92.82%) for the Landsat-7 ETM+ image, but it is slightly higher when using the DT algorithm (98.34%) instead of the RF algorithm (97.79%) for the Landsat-8 OLI image. The overall RF classification algorithm outperforms DT because all RF classification model parameters provide a higher producer accuracy in mapping mangrove forests. This development of the classification method should support the monitoring and rehabilitation programs of mangroves more quickly and easily, particularly in Indonesia.
- Research Article
26
- 10.3390/ijgi9050329
- May 19, 2020
- ISPRS International Journal of Geo-Information
Decision tree (DT) algorithms are important non-parametric tools used for land cover classification. While different DTs have been applied to Landsat land cover classification, their individual classification accuracies and performance have not been compared, especially on their effectiveness to produce accurate thresholds for developing rulesets for object-based land cover classification. Here, the focus was on comparing the performance of five DT algorithms: Tree, C5.0, Rpart, Ipred, and Party. These DT algorithms were used to classify ten land cover classes using Landsat 8 images on the Copperbelt Province of Zambia. Classification was done using object-based image analysis (OBIA) through the development of rulesets with thresholds defined by the DTs. The performance of the DT algorithms was assessed based on: (1) DT accuracy through cross-validation; (2) land cover classification accuracy of thematic maps; and (3) other structure properties such as the sizes of the tree diagrams and variable selection abilities. The results indicate that only the rulesets developed from DT algorithms with simple structures and a minimum number of variables produced high land cover classification accuracies (overall accuracy > 88%). Thus, algorithms such as Tree and Rpart produced higher classification results as compared to C5.0 and Party DT algorithms, which involve many variables in classification. This high accuracy has been attributed to the ability to minimize overfitting and the capacity to handle noise in the data during training by the Tree and Rpart DTs. The study produced new insights on the formal selection of DT algorithms for OBIA ruleset development. Therefore, the Tree and Rpart algorithms could be used for developing rulesets because they produce high land cover classification accuracies and have simple structures. As an avenue of future studies, the performance of DT algorithms can be compared with contemporary machine-learning classifiers (e.g., Random Forest and Support Vector Machine).
- Research Article
1
- 10.1186/s13677-023-00542-3
- Nov 21, 2023
- Journal of Cloud Computing
In practical data mining, a wide range of classification algorithms is employed for prediction tasks. However, selecting the best algorithm poses a challenging task for machine learning practitioners and experts, primarily due to the inherent variability in the characteristics of classification problems, referred to as datasets, and the unpredictable performance of these algorithms. Dataset characteristics are quantified in terms of meta-features, while classifier performance is evaluated using various performance metrics. The assessment of classifiers through empirical methods across multiple classification datasets, while considering multiple performance metrics, presents a computationally expensive and time-consuming obstacle in the pursuit of selecting the optimal algorithm. Furthermore, the scarcity of sufficient training data, denoted by dimensions representing the number of datasets and the feature space described by meta-feature perspectives, adds further complexity to the process of algorithm selection using classical machine learning methods. This research paper presents an integrated framework called eML-CBR that combines edge edge-ML and case-based reasoning methodologies to accurately address the algorithm selection problem. It adapts a multi-level, multi-view case-based reasoning methodology, considering data from diverse feature dimensions and the algorithms from multiple performance aspects, that distributes computations to both cloud edges and centralized nodes. On the edge, the first-level reasoning employs machine learning methods to recommend a family of classification algorithms, while at the second level, it recommends a list of the top-k algorithms within that family. This list is further refined by an algorithm conflict resolver module. The eML-CBR framework offers a suite of contributions, including integrated algorithm selection, multi-view meta-feature extraction, innovative performance criteria, improved algorithm recommendation, data scarcity mitigation through incremental learning, and an open-source CBR module, reshaping research paradigms. The CBR module, trained on 100 datasets and tested with 52 datasets using 9 decision tree algorithms, achieved an accuracy of 94% for correct classifier recommendations within the top k=3 algorithms, making it highly suitable for practical classification applications.
- Research Article
106
- 10.1016/j.eswa.2011.01.042
- Jan 24, 2011
- Expert Systems with Applications
AdaBoost ensemble for financial distress prediction: An empirical comparison with data from Chinese listed companies
- Research Article
73
- 10.1109/tsmcb.2008.923529
- Oct 1, 2008
- IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)
Traditional decision tree algorithms face the problem of having sharp decision boundaries which are hardly found in any real-life classification problems. A fuzzy supervised learning in Quest (SLIQ) decision tree (FS-DT) algorithm is proposed in this paper. It is aimed at constructing a fuzzy decision boundary instead of a crisp decision boundary. Size of the decision tree constructed is another very important parameter in decision tree algorithms. Large and deeper decision tree results in incomprehensible induction rules. The proposed FS-DT algorithm modifies the SLIQ decision tree algorithm to construct a fuzzy binary decision tree of significantly reduced size. The performance of the FS-DT algorithm is compared with SLIQ using several real-life datasets taken from the UCI Machine Learning Repository. The FS-DT algorithm outperforms its crisp counterpart in terms of classification accuracy. FS-DT also results in more than 70% reduction in size of the decision tree compared to SLIQ.
- Research Article
18
- 10.1007/s13246-021-00970-y
- Jan 12, 2021
- Physical and engineering sciences in medicine
The present paper proposes a smart framework for detection of epileptic seizures using the concepts of IoT technologies, cloud computing and machine learning. This framework processes the acquired scalp EEG signals by Fast Walsh Hadamard transform. Then, the transformed frequency-domain signals are examined using higher-order spectral analysis to extract amplitude and entropy-based statistical features. The extracted features have been selected by means of correlation-based feature selection algorithm to achieve more real-time classification with reduced complexity and delay. Finally, the samples containing selected features have been fed to ensemble machine learning techniques for classification into several classes of EEG states, viz. normal, interictal and ictal. The employed techniques include Dagging, Bagging, Stacking, MultiBoost AB and AdaBoost M1 algorithms in integration with C4.5 decision tree algorithm as the base classifier. The results of the ensemble techniques are also compared with standalone C4.5 decision tree and SVM algorithms. The performance analysis through simulation results reveals that the ensemble of AdaBoost M1 and C4.5 decision tree algorithms with higher-order spectral features is an adequate technique for automated detection of epileptic seizures in real-time. This technique achieves 100% classification accuracy, sensitivity and specificity values with optimally small classification time.
- Research Article
63
- 10.1109/tevc.2013.2291813
- Dec 1, 2014
- IEEE Transactions on Evolutionary Computation
Decision-tree induction algorithms are widely used in machine learning applications in which the goal is to extract knowledge from data and present it in a graphically intuitive way. The most successful strategy for inducing decision trees is the greedy top-down recursive approach, which has been continuously improved by researchers over the past 40 years. In this paper, we propose a paradigm shift in the research of decision trees: instead of proposing a new manually designed method for inducing decision trees, we propose automatically designing decision-tree induction algorithms tailored to a specific type of classification data set (or application domain). Following recent breakthroughs in the automatic design of machine learning algorithms, we propose a hyper-heuristic evolutionary algorithm called hyper-heuristic evolutionary algorithm for designing decision-tree algorithms (HEAD-DT) that evolves design components of top-down decision-tree induction algorithms. By the end of the evolution, we expect HEAD-DT to generate a new and possibly better decision-tree algorithm for a given application domain. We perform extensive experiments in 35 real-world microarray gene expression data sets to assess the performance of HEAD-DT, and compare it with very well known decision-tree algorithms such as C4.5, CART, and REPTree. Results show that HEAD-DT is capable of generating algorithms that significantly outperform the baseline manually designed decision-tree algorithms regarding predictive accuracy and F-measure.
- Book Chapter
4
- 10.3233/apc220031
- Nov 3, 2022
The target of the task is to foresee the coronary illness by Novel Decision Tree (DT) in examination with k-Nearest Neighbor (KNN) utilizing Cleveland dataset. Coronary Disease forecasting is performed by applying Decision Tree (N=20) and k-Nearest Neighbor (N=20) algorithms. Decision Tree algorithm uses the tree structure to make decisions. K-nearest neighbor is an easy approach to solve regression and classification problems. Cleveland heart dataset is utilized for identification and prediction. The data consists of 76 attributes however, only 14 features are selected that help in diagnosing a patient healthy or affected. Accuracy of cardiovascular risk prediction using k-NN is 68.9% & using decision tree is 81.9%. There exists a statistical significant difference between DT and k-NN with 0.035(p<0.05). Decision Tree algorithm appears to perform significantly better than k-Nearest Neighbor algorithm for heart disease prediction.
- Conference Article
2
- 10.1109/iconstem56934.2023.10142280
- Apr 6, 2023
As a direct outcome of this research, it is planned that the accuracy of house price projections will be enhanced by using a novel decision tree algorithm rather than linear regression. This will be done in order to achieve the desired result (LR). The N=10 iteration of the Decision Tree Algorithm is put to use in order to generate the prediction. The size of the sample is figured out with the use of a G power Calculator, and a cutoff of 80% is decided upon as the minimum need for sufficient analytical power. The Linear Regression Method can be found in Group 1, whereas the New Decision Tree Algorithm can be found in Group 2. The confidence interval for the pre-test power is from 95% to 80%, the alpha value is 0.05, the beta value is 0.2, and the total number of participants in the study is twenty. In contrast, the accuracy of the New Decision Tree (DT) Algorithm was 90%, while the accuracy of the Linear Regression Algorithm was 80%. The findings of the statistical analysis that was carried out with the assistance of SPSS showed that the value of accuracy was insignificant: p=0.618 (p>0.05). The Innovative Decision Tree Algorithm outperforms the Linear Regression approach when it comes to estimating the value of real estate in the future.
- Research Article
- 10.1109/jbhi.2025.3539710
- Oct 1, 2025
- IEEE journal of biomedical and health informatics
Intrinsically disordered regions (IDRs) of proteins are crucial for a wide range of biological functions, with molecular recognition features (MoRFs) being of particular significance in protein interactions and cellular regulation. However, the identification of MoRFs has been a significant challenge in computational biology owing to their disorder-to-order transition properties. Currently, only a limited number of experimentally validated MoRFs are known, which has prompted the development of computational methods for predicting MoRFs from protein chains. Considering the limitations of existing MoRF predictors regarding prediction accuracy and adaptability to diverse protein sequence lengths, this study introduces Trans-MoRFs, a novel MoRF predictor based on the transformer architecture, for identifying MoRFs within IDRs of proteins. Trans-MoRFs employ the self-attention mechanism of the transformer to efficiently capture the interactions of distant residues in protein sequences. They demonstrate stability and high efficiency in dealing with protein sequences of different lengths and performs well on both short and long sequences. On multiple benchmark datasets, the model attained a mean area under the curve score of 0.94, which is higher than those of all existing models, and significantly outperformed existing combined and single MoRF prediction tools on multiple performance metrics. Trans-MoRFs have excellent accuracy and a wide range of applications for predicting MoRFs and other functionally important fragments in the disordered regions of proteins. They offer significant assistance in comprehending protein functions, precisely pinpointing functional segments within disordered protein regions and facilitating the discovery of novel drug targets.
- Research Article
- 10.32736/sisfokom.v13i1.1943
- Feb 12, 2024
- Jurnal Sisfokom (Sistem Informasi dan Komputer)
Graduating on time is what every student wants to accomplish in college. Students of Prof. Dr. Hamka Muhammadiyah University are one of those who have this dream. Based on 2020 graduates data from the Tracer Study, 60% said the university had a high enough impact on improving competence. This data indicates that university needs to evaluate improvement of academic quality. Often, students have difficulty finding information about important factors that support achieving timely graduation. A prediction analysis is needed to provide information about the student's graduation study period. For this analysis, data mining is implemented using the classification function of the decision tree (C4.5) algorithm with RapidMiner tools. The methodology for implementing data mining follows the stages of Knowledge Discovery In Database (KDD), beginning with data collection, preprocessing, transformation, data mining, and evaluation. The research findings consist of visualization and decision tree rules that reveal GPA as the most influential factor in determining a student's study period.There is other information, namely, students graduated on time (less than equal to 4 years) amounted to 170 or 54.5% and students did not graduate on time (more than 4 years) amounted to 142 or 45.6%. Testing the performance of decision tree (C4.5) utilizing confusion matrix through RapidMiner tools, resulted in accuracy reaching 83.87%, with precision of 87.50% and recall of 91.18%. Provides evidence that the decision tree algorithm (C4.5) has optimal performance to provide valuable information about predicting student graduation in order to increase student enrollment with the right study period.
- Research Article
- 10.18137/cardiometry.2022.25.15841589
- Feb 14, 2023
- CARDIOMETRY
Aim: The main aim of this research is to detect heart plaque using the Decision Tree algorithm with improved accuracy and comparing it with Least Squares Support Vector Machine. Materials and Methods: Decision tree and Least squares Support Vector Machine algorithms are two groups compared in this study. Each group has 20 samples and calculations utilized pretest power of 0.08 with 95% confidence interval. The G power is estimated for samples using clincalc, which has two groups: alpha, power, and enrollment ratio. These samples are split into two groups: training dataset (n = 489 [70%]) and test dataset (n = 277 [30%]). Results: The accuracy obtained for Decision Tree was 68.13 % and 67.3 % for the Least Squares Support Vector Machine technique. Since p (2-tailed) < 0.05, in SPSS statistical analysis, a significant difference exists between the two groups. Conclusion: It is found that the Decision Tree algorithm is significantly better than the Least Squares Support Vector Machine algorithm in Heart plaque disease detection for the dataset considered.
- Research Article
4
- 10.1149/10701.14097ecst
- Apr 24, 2022
- ECS Transactions
The aim of this study is to predict the temperature for the next three days using a machine learning algorithm and the objective of this research is to improve the accuracy of temperature prediction. Materials and methods: The study used 36 samples with two groups of algorithms with the g-power value of 80%. To predict the temperature, the Sliding Window Algorithm (SLA) machine learning algorithm has found 89% of accuracy and this study will find better accuracy for temperature prediction with the Decision Tree (DT) algorithm. Results: This research study found 92% of accuracy for temperature prediction using a DT algorithm. Conclusion: This study concludes that the DT algorithm on temperature prediction is significantly better than the SLA.
- Conference Article
3
- 10.2118/197762-ms
- Nov 11, 2019
Bottom hole pressures are valuable source of information for reservoir surveillance and management and are the heart of reservoir engineering. Real – time pressure measurements record pressure data at 5 second interval resulting in enormous accumulation of data. The size and volume of the accumulated data limit the capability of existing analysis software to load and interpret data. This paper presents an improved methodology for data quality checking and data optimization in determining reservoir pressure depletion via Autoregressive Integrated Moving Average (ARIMA) and Decision Tree Model. Dataset was gathered from a representative reservoir from Malay Basin. The ARIMA algorithm presented was designed for quick and efficient data quality checking. The Decision Tree Model in other hand was utilized to select maximum buildup pressure for reservoir depletion point via well status parameters. The maximum pressures were selected from buildup up data when the decision tree conditions were met. Versus classical methods, the algorithm has obtained around 90% similarity. The resulting data were then can fully optimized for reserve reporting and forecasting study i.e. analysis and numerical simulation. The paper also reports on the advantages in the application of ARIMA – Decision Tree Algorithm in pressure surveillance revealing few key advantages namely minimize the need of well intervention and optimized workflow for reservoir engineer to view, utilize, and detect reservoir depletion data. ARIMA – Decision Tree Algorithm is targeted to be installed and integrated in field historian for better overall data analysis and visualization. Results produced from the ARIMA – Decision Tree Algorithm which consist of reservoir pressure depletion data will then improve more advance analysis such as simulation and forecasting in terms of overall speed and accuracy. As a conclusion, this paper presents the importance and application of incorporating Big Data Analytics Algorithm in reservoir management and reporting. Future work, deliverability calculations can be incorporated in the model to identify and rectify any abnormal reservoir behavior.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.