Diffuse glioma tumor subtype classification using CliRad (clinical + radiomics) features
Abstract Background Diffuse gliomas like glioblastoma, astrocytoma, and oligodendroglioma are defined by complex heterogeneity in imaging and molecular patterns. Traditional diagnosis relies severely on invasive biopsies and human interpretation of MRI scans, which are subjective and severely limited in capturing the complete volumetric and biological heterogeneity of the tumor. Artificial Intelligence (AI) can potentially enhance glioma tumor subtype classification through the analysis and interpretation of medical imaging information. Machine learning, a branch of AI, is key to discovering patterns and characteristics in the data to enable predictive models for prognosis and diagnosis. AI can process MRI scans to derive tumor shape, size, and texture information, and process large datasets to define risk factors and features. This method may improve the speed and accuracy of glioma subtype diagnoses and facilitate individualized treatment planning. Aim In the current study, we have evaluated the performance of multiple machine learning classifiers in distinguishing between glioblastoma, astrocytoma, and oligodendroglioma based on radiomics and clinical features. The goal of the study was to identify effective feature selection techniques that enhance the accuracy and reliability of classification models and to develop a methodology that can be adapted for use with imaging data collected at our institute, supporting future diagnostic workflows. Methods A total 729 radiomic features were expended from TCIA for the experimental analysis. The radiomic signature of significant features has been created for every sample using XGBoost decision tree, XGBoost random forest, CatBoost and Light GBM tree based feature selection algorithm. The experimentation has been done by training 13 different models. The hyper parameters of each of the model has been tuned and different performance parameters like accuracy, precision, recall, F1 score and AUC-ROC curve have been compared. Results Feature selection using XGBoost decision tree, XGBoost random forest, Catboost and Light GBM tree- based classifier is used for selecting the top 21, 37, 71 and 82 features based on feature importance score. The experimental results shows that the feature subset by XGBoost decision tree method gave the best performance as compared to others. A total of 13 classification models were trained and tested giving a best tenfold validation accuracy, test accuracy, macro F1-score of 95.2%, 77%, 0.716 and AUC of 0.92 for Catboost classifier.
- Research Article
- 10.59324/stss.2026.3(1).10
- Jan 1, 2026
- Scientia. Technology, Science and Society
The digital infrastructure has become more complex than ever because of the fast pace of data exchange and the constant development of advanced cybercrime in the digital realm, the management of the network traffic securely and efficiently has become more challenging than ever before. Conventional rule based and statistical techniques can be expected not to keep abreast with changing network characteristics, resulting in false or slow anomaly detection. To overcome this drawback, the current study employs machine learning methods to advance the quality and effectiveness of the network anomaly detecting process as an analytical basis of smart network traffic management. To analyze the effects of various feature selection techniques on the performance of machine learning classifiers, the study uses the NSLKDD dataset that consists of labelled samples of both normal and attack traffic. Three feature selection algorithms Chi-Square (CHI), Correlation based Selection (CORR), and Feature Importance (FI) were implemented separately and the outcome feature subsets were used to train three classifiers; K-Nearest Neighbors (KNN), Decision Tree (DT) and Random Forest (RF). Accuracy, F1-score and the training time were the performance indicators used to test the models. The analysis of the experimental results showed that the Feature Importance (FI) of feature subset performed the best in terms of detection performance whereas the crux of the analysis was the classifier of the experimental results, the Random Forest (RF) and Decision Tree (DT) as compared to the KNN model. The optimal combination of the RF classifier and FI feature selection obtained a 1.0 F1-score, which proves that this method has a high ability to differentiate between normal and anomalous network traffic. The findings attest to the importance of feature selection in optimizing machine learning based anomaly detection systems. The study makes a practical contribution to the implementation of artificial intelligence methods in the management of network traffic, which will be essential in making more flexible, precise, and intelligent network monitoring systems.
- Research Article
2
- 10.1016/j.ibmed.2024.100151
- Jan 1, 2024
- Intelligence-Based Medicine
Efficient Feature Selection for Classification of Immunotherapy and Medical Treatments Utilising Random Forest and Decision Trees
- Research Article
2
- 10.1080/19361610.2022.2067459
- Apr 27, 2022
- Journal of Applied Security Research
Malware is the term used to describe any malicious software or code that is harmful to systems. From day to day, new malicious programs appear. To classify malware according to its characteristics, machine learning is now being used; this is because most new malware contains patterns that are similar to old ones. This paper proposes two feature selection methods based on Genetic Programming (GP) for predicting malware; the first is called Genetic Programming-Mean (GPM), and the second is called Genetic Programming-Mean Plus (GPMP). The results of these two methods were compared with three state-of-the-art popular feature selection techniques: filter-based, wrapper-based, and Chi-square. In this work, we compare the two proposed methods (GPM and GPMP) with these three widely used feature selection techniques. The results demonstrate that the proposed techniques beat these state-of-the-art ones in terms of accuracy and F-score. The results also revealed that the proposed methods employed less computation time and hence an enhanced performance when compared with filter-based, and wrapper-based feature selection. The proposed methods were evaluated using four datasets. Two classifiers were used to evaluate the proposed feature selection methods: Random Forest and Decision Tree. When a Random Forest classifier is used, our results showed that it outperformed the Decision Tree classifier in indicators, such as F1-score, recall, and precision. The analysis of results using Random Forest and Decision Tree proves that the proposed method is highly efficient.
- Book Chapter
1
- 10.1007/978-981-16-6605-6_39
- Jan 1, 2022
The world interconnectivity for the Internet infrastructure has been increasing incredibly because of the enormous increase of data day by day. As most of the communications is dependent on data, the generation and use of the data have become very high so the maintenance of these data is the challenging part. For the security of data, intrusion detection system (IDS) has become a vital component. In machine learning, many IDS models proposed the previous single shallow technique used was not effective in identifying intrusion for the unique pattern. This paper introduces a machine learning model big data-based hybrid hierarchical model (BDHHM) with an Apache spark framework. BDHHM is a hybrid of two hierarchical models such as k-means and random forest tree, hence will increase the detection rate of the intrusion attack. Improved k-means and random forest tree are implemented in this work to identify the unique patterns. In deep learning, one of the artificial intelligence (AI) functions is used to process the data and comparison is made with deep learning models such as FC, CNN, and RNN. This model is effective compared to other models [14] with a true positive rate (TPR) of 96.16%, an accuracy of 95.3%, and a false positive rate of 9.1%.KeywordsIntrusion detection system (IDS)Big data-based hybrid hierarchical model (BDHHM)K-meansRandom forest tree
- Conference Article
3
- 10.1109/csit56902.2022.10000450
- Nov 10, 2022
The paper proposes a new voting technology for random forest trees – the Positional Approach to the Voting Function Formation (PAVFF). In contrast to existing forms of organizing the voting of random forest trees, the paper proposes to change the subjects of voting and to use as such individual finite elements of the tree, with weights determined in accordance with the competences of new voting units. Each forest tree in the voting function is represented by its individual branch (voting unit) with the corresponding competence level assigned at the stage of tree finite element verification. Furthermore, we propose different mechanisms for organizing the received units in the voting process. The effectiveness of the new mechanism is shown by the example of the differentiation problem of drug-sensitive and drug-resistant forms of tuberculosis. The task feature space is formed by the ROI (regions of interest) textural characteristics on the patient’s lungs CT scan. The initial feature space was composed of the elements of a few textural characteristic matrices. From over half a million input features, a few optimal ensembles were selected to form random forest trees. We used intra- and inter-class variance selection techniques for this purpose, with the final selection made by a genetic algorithm using a combined correlation criterion. After verification of voting units (finite elements of trees) 3 variants of voting by competence were formed: by the most competent unit, weighted average participation of all participants, and group voting with coefficient revaluation by the Group Method of Data Handling. The results were compared with similar voting by random forest trees. A 5% improvement in classification quality is shown.
- Research Article
- 10.61173/fvzhe382
- Oct 29, 2024
- Science and Technology of Engineering, Chemistry and Environmental Protection
The primary objective of this study is to evaluate and compare the performance of three machine learning models— Random Forest, XGBoost, and Decision Tree—in the context of fruit and vegetable image classification. This research aims to identify which model best handles the challenges associated with imbalanced datasets and complex data structures. The ultimate goal is to contribute to the development of more efficient and accurate automated systems for agricultural applications, thereby improving productivity and reducing operational costs in the industry. This study utilized a dataset of 3,825 images covering 36 fruit and vegetable classes. Images were resized, normalized, and augmented to enhance diversity. Three models—Random Forest, XGBoost, and Decision Tree—were trained on this dataset. Performance was evaluated using accuracy, precision, recall, and F1-score to assess classification effectiveness and handling of class imbalances. The evaluation revealed that XGBoost outperformed Random Forest and Decision Tree in fruit and vegetable image classification, achieving the highest accuracy of 96.66%. XGBoost demonstrated superior handling of class imbalances and complex data structures, reflected in its precision and recall scores across various classes. Random Forest also performed well, closely following XGBoost, while Decision Tree exhibited more variability in results, indicating potential overfitting in certain classes. In conclusion, this study highlights the effectiveness of ensemble methods, particularly XGBoost, in agricultural image classification tasks. These findings suggest that XGBoost is a robust model for similar applications, offering improved accuracy and reliability
- Research Article
1
- 10.1186/s12902-025-01873-9
- Mar 27, 2025
- BMC Endocrine Disorders
BackgroundHyperglycemic crisis is one of the most common and severe complications of diabetes mellitus, associated with a high motarlity rate. Emergency admissions due to hyperglycemic crisis remain prevalent and challenging. This study aimed to develop and validate predictive models for in-hospital mortality risk among patients with hyperglycemic crisis admitted to the emergency department using various machine learning (ML) methods.MethodsA multi-center retrospective study was conducted across six large general adult hospitals in Chongqing, western China. Patients diagnosed with hyperglycemic crisis were identified using an electronic medical record (EMR) database. Demographics, comorbidities, clinical characteristics, laboratory results, complications, and therapeutic interventions were extracted from the medical records to construct the prognostic prediction model. Seven machine learning algorithms, including support vector machines (SVM), random forest (RF), recursive partitioning and regression trees (RPART), extreme gradient boosting with dart booster (XGBoost), multivariate adaptive regression splines (MARS), neural network (NNET), and adaptive boost (AdaBoost) were compared with logistic regression (LR) for predicting the risk of in-hospital mortality in patients with hyperglycemic crisis. Stratified random sampling was used to split the data into training (80%) and validation (20%) sets. Ten-fold cross validation was performed on the training set to optimize model hyperparameters. The sensitivity, specificity, positive and negative predictive values, area under the curve (AUC) and accuracy of all models were computed for comparative analysis.ResultsA total of 1668 patients were eligible for the present study. The in-hospital mortality rate was 7.3% (121/1668). In the training set, feature importance scores were calculated for each of the eight models, and the top 10 significant features were identified. In the validation set, all models demonstrated good predictive capability, with areas under the curve value exceeding 0.9 with a F1 score between 0.632 and 0.81, except the MARS model. Six machine learning algorithm models outperformed the referred logistic regression algorithm except the MARS model. Among the selected models, RPART, RF, and SVM achieved the best performance in the selected models (AUC values were 0.970, 0.968 and 0.968, F1 score were 0.652, 0.762, 0.762 respectively). Feature importance analysis identified novel predictors including mechanical ventilation, age, Charlson Comorbidity Index, blood gas index, first 24-hour insulin dosage, and first 24-hour fluid intake.ConclusionMost machine learning algorithms exhibited excellent performance predicting in-hospital mortality among patients with hyperglycemic crisis except the MARS model, and the best one was RPART model. These algorithms identified overlapping but different, up to 10 predictors. Early identification of high-risk patients using these models could support clinical decision-making and potentially improve the prognosis of hyperglycemic crisis patients.Clinical trial numberNot applicable.
- Research Article
1
- 10.1002/spy2.494
- Feb 5, 2025
- SECURITY AND PRIVACY
ABSTRACTIn today's era of increased smart device usage, the prevalence of malicious attacks targeting mobile devices has risen significantly. Malware developers aim to infect as many mobile devices as possible, capitalizing on the vulnerabilities present in different operating systems that often host sensitive programs such as educational and banking applications. To combat this issue, this article introduces a mobile botnet detection system that leverages machine learning algorithms such as random forest, k‐nearest neighbor, logistic regression, and decision trees. Additionally, the system incorporates the use of the golden ratio evolutionary algorithm for dimension reduction and feature selection. The proposed method's performance is evaluated using detection accuracy, Precision, F1 Score, and Recall as evaluation criteria on NSL‐KDD, Drebin, and ISCX datasets. The results are compared with evaluation values obtained without applying the golden ratio feature selection algorithm and once again with the proposed method. The findings demonstrate that the proposed method, particularly when combined with the random forest algorithm, achieves the highest accuracy in the desired dataset. Moreover, the decision tree and k‐nearest neighbor algorithms also exhibit superior detection accuracy within the proposed method's framework. By utilizing the proposed method, the training time is reduced through feature selection, enabling swift mobile botnet identification.
- Conference Article
3
- 10.1109/icears53579.2022.9752308
- Mar 16, 2022
Feature Selection is selecting relevant features according to the feature scores. It is the most traditional process of eliminating irrelevant features, reducing dimensionality, and improving classification accuracy. This paper proposes a FSMDAD model for selecting the top ten features using Chi-Square, Extra Tree, ANOVA and Mutual Information feature selection methods. Further, most influencing features have been derived as: total time between two packets in the forward direction (FwdIATTotal) and duration of the flow in microseconds (FlowDuration). Finally, a series of iterations have been performed by integrating the above feature selection methods with machine learning classifiers (random forest and decision tree). Random forest and Decision Tree give dominant results with the extra tree method. Since Extra Tree classifier performs best, so best features obtained for different types of DDoS attacks have been calculated using this classifier.
- Research Article
2
- 10.48185/jaai.v3i2.601
- Dec 31, 2022
- Journal of Applied Artificial Intelligence
Enrollment in courses is a key performance indicator in educational systems for maintaining academic and financial viability. Today, a lot of factors, comprising demographic and individual features like age, gender, academic background, financial capabilities, and academic degree of choice, contribute to the attrition rates of students at various higher education institutions. In this study, we developed prediction models for students' attrition rate in pursuing a computer science degree as well as those who have a high chance of dropping out before graduation using machine learning methodologies. This approach can assist higher education institutions in creating effective interventions to lower attrition rates and raise the likelihood that students will succeed academically. Student data from 2015 to 2022 were collected from the Federal University Lokoja (FUL), Nigeria. The data was preprocessed using existing WEKA machine learning libraries where our data was converted into attribute-related file form (ARFF). Further, the resampling techniques were used to partition the data into the training set and testing set, and correlation-based feature selection was extracted and used to develop the students' attrition model to identify the students' risk of attrition. Random Forest and decision tree machine learning algorithms were used to predict students' attrition. The results showed that Random Forest has 79.45% accuracy while the accuracy of Random tree stood at 78.09%. This is an improvement over previous results, where an accuracy of 66.14%. and 57.48% were recorded for random forest and Random tree respectively. This improvement was because of the techniques demonstrated in this study. It is recommended that applying techniques to the classification model will improve the performance of the model.
- Research Article
44
- 10.3233/ida-150789
- Nov 3, 2015
- Intelligent Data Analysis
Decision tree is a simple and effective method and it can be supplemented with ensemble methods to improve its performance. Random Forest and Rotation Forest are two approaches which are perceived as classic at present. They can build more accurate and diverse classifiers than Bagging and Boosting by introducing the diversities namely randomly chosen a subset of features or rotated feature space. However, the splitting criteria used for constructing each tree in Random Forest and Rotation Forest are Gini index and information gain ratio respectively, which are skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. Hellinger distance decision tree (HDDT) was proposed by Chawla, which is skew-insensitive. Especially, bagged unpruned HDDT has proven to be an effective way to deal with highly imbalanced problem. Nevertheless, the bootstrap sampling used in Bagging can lead to ensembles of low diversity compared to Random Forest and Rotation Forest. In order to combine the skew-insensitivity of HDDT and the diversities of Random Forest and Rotation Forest, we use Hellinger distance as the splitting criterion for building each tree in Random Forest and Rotation Forest respectively. An experimental framework is performed across a wide range of highly imbalanced datasets to investigate the effectiveness of Hellinger distance, information gain ratio and Gini index which are used as the splitting criteria in ensembles of decision trees including Bagging, Boosting, Random Forest and Rotation Forest. In addition, Balanced Random Forest is also included in the experiment since it is designed to tackle class imbalance problem. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance as the splitting criterion to build individual decision tree in forest can improve the performances of Random Forest and Rotation Forest for highly imbalanced classification.
- Conference Article
5
- 10.1109/icacat.2018.8933787
- Dec 1, 2018
Email is necessary and essential for communication in today's life. Today internet users are increases, and email is necessary for communication over the internet. Spam mail is a major and big problem of researchers to analyze and reduce it. Spam emails are received in bulk amount and it contains trojans, viruses, malware and causes phishing attacks. Problems are arise when number of unwanted mails are come from unknown sites and how to classify the user that email are received which is spam email or ham. This paper used to classify that incoming emails are spam mail or ham by the use of different classification techniques to identify spam mail and remove it. Naive bayes classifier are apply in the concept of posterior probability and decision tree algorithms are apply namely Random Tree, REPTree, Random Forest,and J48 decision tree classifier. For the identification of spam mail, UCI spambase dataset is used. It is a benchmark dataset which contains 58 attributes and 4601 instances. Weka software is used for the analysis and implementation of results. In Weka tool, classification algorithms are used to find spam mail in the classification phase of weka software.These papers play a very important role to remove viruses, trojans, malware and websites including phishing attacks and fraudulent attempts in emails. Feature selection is applied on dataset for training set and cross validation. Cfs Subset evaluation method is used for best first method in feature selection. For the classification of spam mail, we use two tests are cross validation and training set under classifier option in Weka Tool. For training set, same data will be used for training and testing. And for cross validation, training data is segmented in a number of folds. And finally using training set, Random Tree gives the best result for the classification of spam mail.
- Research Article
- 10.1016/j.cmpb.2025.109170
- Feb 1, 2026
- Computer methods and programs in biomedicine
A stable feature selection method based on majority voting and SHAP for high-dimensional metabolomics data.
- Research Article
4
- 10.3390/f14091864
- Sep 13, 2023
- Forests
An accurate and efficient estimation of eucalyptus plantation areas is of paramount significance for forestry resource management and ecological environment monitoring. Currently, combining multidimensional optical and SAR images with machine learning has become an important method for eucalyptus plantation classification, but there are still some challenges in feature selection. This study proposes a feature selection method that combines multi-temporal Sentinel-1 and Sentinel-2 data with SLPSO (social learning particle swarm optimization) and RFE (Recursive Feature Elimination), which reduces the impact of information redundancy and improves classification accuracy. Specifically, this paper first fuses multi-temporal Sentinel-1 and Sentinel-2 data, and then carries out feature selection by combining SLPSO and RFE to mitigate the effects of information redundancy. Next, based on features such as the spectrum, red-edge indices, texture characteristics, vegetation indices, and backscatter coefficients, the study employs the Simple Non-Iterative Clustering (SNIC) object-oriented method and three different types of machine-learning models: Random Forest (RF), Classification and Regression Trees (CART), and Support Vector Machines (SVM) for the extraction of eucalyptus plantation areas. Each model uses a supervised-learning method, with labeled training data guiding the classification of eucalyptus plantation regions. Lastly, to validate the efficacy of selecting multi-temporal data and the performance of the SLPSO–RFE model in classification, a comparative analysis is undertaken against the classification results derived from single-temporal data and the ReliefF–RFE feature selection scheme. The findings reveal that employing SLPSO–RFE for feature selection significantly elevates the classification precision of eucalyptus plantations across all three classifiers. The overall accuracy rates were noted at 95.48% for SVM, 96% for CART, and 97.97% for RF. When contrasted with classification outcomes from multi-temporal data and ReliefF–RFE, the overall accuracy for the trio of models saw an increase of 10%, 8%, and 8.54%, respectively. The accuracy enhancement was even more pronounced when juxtaposed with results from single-temporal data and ReliefF-RFE, at increments of 15.25%, 13.58%, and 14.54% respectively. The insights from this research carry profound theoretical implications and practical applications, particularly in identifying and extracting eucalyptus plantations leveraging multi-temporal data and feature selection.
- Research Article
69
- 10.1007/s42454-020-00006-y
- Apr 6, 2020
- Human-Intelligent Systems Integration
Thyroid disease is spreading very rapidly among women after the age of 30 years. Therefore, it is necessary to examine the thyroid dataset for predicting the disease at early stage so that precautions can be taken to protect the dangerous condition of thyroid cancer. A decision tree is used to extract hidden patterns from the stored datasets. The objective of this research paper is to examine the thyroid disease dataset using decision tree, random forest, and classification and regression tree (CART), and after obtaining the results of these classifiers, we enhanced the results using the bagging ensemble technique. The proposed experiment was done on 3710 instances and 29 features of thyroid patients. The overall prediction depends on target variable whch is divided in sick and negative class. The accuracy of the prediction was calculated on the basis of different num-fold and seed values. Different classification algorithms are analyzed using thyroid dataset. The results obtained by individual classification algorithms like decision tree, random forest tree, and extra tree give an accuracy of 98%, 99%, and 93%, respectively. Then, we developed a bagging ensemble method combining the three basic tree classifiers and apply again on the same dataset, which gives a better accuracy of 100% in the case of seed value 35 and num-fold value 10. This proposed ensemble method can be used for better prediction of thyroid disease.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.