Support Vector Machine for Accurate Classification of Diabetes Risk Levels
This research explores the application of Support Vector Machines (SVM) for accurately classifying diabetes risk levels based on a publicly available dataset containing 768 instances and 9 attributes, including glucose levels, BMI, blood pressure, and insulin levels. The model's systematic development process involved data preprocessing, feature selection, and hyperparameter optimization to ensure robust performance. Results indicate an overall accuracy of 76%, with high precision and recall for the non-diabetic risk class, but relatively lower performance for the diabetic risk class, highlighting the challenges posed by class imbalance and overlapping data features. To address these issues, future research should incorporate advanced resampling techniques, refined feature engineering, and alternative machine learning models like Random Forest or XGBoost. This research underscores the potential of SVM as a valuable tool for early diabetes detection, offering healthcare professionals a reliable means to identify at-risk individuals and personalize intervention strategies. By bridging theoretical advancements and practical applications, the research contributes to enhancing predictive analytics in medical diagnostics, paving the way for improved patient outcomes and efficient public health management
- Conference Article
1
- 10.1109/raics.2011.6069364
- Sep 1, 2011
The objective of this paper is to compare the performance of Hierarchical Soft (max-min) Decision Trees and Support Vector Machine (SVM) in optimization of fuzzy outputs for the classification of epilepsy risk levels from EEG (Electroencephalogram) signals. The fuzzy pre classifier is used to classify the risk levels of epilepsy based on extracted parameters like energy, variance, peaks, sharp and spike waves, duration, events and covariance from the EEG signals of the patient. Hierarchical Soft Decision Tree (HDT post classifiers with max-min criteria of four types) and Support Vector Machine (SVM) are applied on the classified data to identify the optimized risk level (singleton) which characterizes the patient's risk level. The efficacy of the above methods is compared based on the bench mark parameters such as Performance Index (PI), and Quality Value (QV).
- Conference Article
2
- 10.1109/cibec.2008.4786062
- Dec 1, 2008
The objective of this paper is to design, simulate, and synthesis a simple, suitable and reliable VLSI fuzzy processor for classification of diabetic epilepsy risk levels. The performance of three different fuzzy techniques are analyzed and compared. While designing the fuzzy processor the cerebral blood flow (CBF), EEG signal features and aggregation operators are taken as parameters. The classification of risk level is based on clinical data and observation. Three different fuzzy techniques with minimum rules such as a two input heterogeneous fuzzy technique, single input rule models (SIRM) are analyzed. The parallel architecture is incorporated in this design with independent functional units. These functional units process the data simultaneously by which the processing speed is enhanced. The SIRM fuzzy system with Bell input - Bell output, and Bell input-Triangle output are simulated and synthesized for various values of Cerebral Blood Flow using VHDL. The simulated and synthesized field programmable gated array (FPGA) fuzzy processor closely follows the mat lab version.
- Research Article
- 10.3233/jifs-233511
- Mar 5, 2024
- Journal of Intelligent & Fuzzy Systems
The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.
- Research Article
3
- 10.1186/s12859-021-04478-w
- Dec 1, 2021
- BMC Bioinformatics
BackgroundSupervised classification methods have been used for many years for feature selection in metabolomics and other omics studies. We developed a novel primal-dual based classification method (PD-CR) that can perform classification with rejection and feature selection on high dimensional datasets. PD-CR projects data onto a low dimension space and performs classification by minimizing an appropriate quadratic cost. It simultaneously optimizes the selected features and the prediction accuracy with a new tailored, constrained primal-dual method. The primal-dual framework is general enough to encompass various robust losses and to allow for convergence analysis. Here, we compare PD-CR to three commonly used methods: partial least squares discriminant analysis (PLS-DA), random forests and support vector machines (SVM). We analyzed two metabolomics datasets: one urinary metabolomics dataset concerning lung cancer patients and healthy controls; and a metabolomics dataset obtained from frozen glial tumor samples with mutated isocitrate dehydrogenase (IDH) or wild-type IDH.ResultsPD-CR was more accurate than PLS-DA, Random Forests and SVM for classification using the 2 metabolomics datasets. It also selected biologically relevant metabolites. PD-CR has the advantage of providing a confidence score for each prediction, which can be used to perform classification with rejection. This substantially reduces the False Discovery Rate.ConclusionPD-CR is an accurate method for classification of metabolomics datasets which can outperform PLS-DA, Random Forests and SVM while selecting biologically relevant features. Furthermore the confidence score provided with PD-CR can be used to perform classification with rejection and reduce the false discovery rate.
- Research Article
47
- 10.1186/s12911-022-01821-w
- Mar 28, 2022
- BMC Medical Informatics and Decision Making
BackgroundImbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Despite various studies, class imbalance has always been a difficult issue. The main objective of this study was to find an effective integrated approach to address the problems posed by class imbalance and to validate the method in an early screening model for a rare cardiovascular disease aortic dissection (AD).MethodsDifferent data-level methods, cost-sensitive learning, and the bagging method were combined to solve the problem of low sensitivity caused by the imbalance of two classes of data. First, feature selection was applied to select the most relevant features using statistical analysis, including significance test and logistic regression. Then, we assigned two different misclassification cost values for two classes, constructed weak classifiers based on the support vector machine (SVM) model, and integrated the weak classifiers with undersampling and bagging methods to build the final strong classifier. Due to the rarity of AD, the data imbalance was particularly prominent. Therefore, we applied our method to the construction of an early screening model for AD disease. Clinical data of 523,213 patients from the Institute of Hypertension, Xiangya Hospital, Central South University were used to verify the validity of this method. In these data, the sample ratio of AD patients to non-AD patients was 1:65, and each sample contained 71 features.ResultsThe proposed ensemble model achieved the highest sensitivity of 82.8%, with training time and specificity reaching 56.4 s and 71.9% respectively. Additionally, it obtained a small variance of sensitivity of 19.58 × 10–3 in the seven-fold cross validation experiment. The results outperformed the common ensemble algorithms of AdaBoost, EasyEnsemble, and Random Forest (RF) as well as the single machine learning (ML) methods of logistic regression, decision tree, k nearest neighbors (KNN), back propagation neural network (BP) and SVM. Among the five single ML algorithms, the SVM model after cost-sensitive learning method performed best with a sensitivity of 79.5% and a specificity of 73.4%.ConclusionsIn this study, we demonstrate that the integration of feature selection, undersampling, cost-sensitive learning and bagging methods can overcome the challenge of class imbalance in a medical dataset and develop a practical screening model for AD, which could lead to a decision support for screening for AD at an early stage.
- Conference Article
3
- 10.1063/1.5012168
- Jan 1, 2017
DNA microarrays are data containing gene expression with small sample sizes and high number of features. Furthermore, imbalanced classes is a common problem in microarray data. This occurs when a dataset is dominated by a class which have significantly more instances than the other minority classes. Therefore, it is needed a classification method that solve the problem of high dimensional and imbalanced data. Support Vector Machine (SVM) is one of the classification methods that is capable of handling large or small samples, nonlinear, high dimensional, over learning and local minimum issues. SVM has been widely applied to DNA microarray data classification and it has been shown that SVM provides the best performance among other machine learning methods. However, imbalanced data will be a problem because SVM treats all samples in the same importance thus the results is bias for minority class. To overcome the imbalanced data, Fuzzy SVM (FSVM) is proposed. This method apply a fuzzy membership to each input point and reformulate the SVM such that different input points provide different contributions to the classifier. The minority classes have large fuzzy membership so FSVM can pay more attention to the samples with larger fuzzy membership. Given DNA microarray data is a high dimensional data with a very large number of features, it is necessary to do feature selection first using Fast Correlation based Filter (FCBF). In this study will be analyzed by SVM, FSVM and both methods by applying FCBF and get the classification performance of them. Based on the overall results, FSVM on selected features has the best classification performance compared to SVM.
- Research Article
24
- 10.1016/j.artmed.2014.06.003
- Jun 21, 2014
- Artificial Intelligence in Medicine
Cancer survival classification using integrated data sets and intermediate information.
- Research Article
3
- 10.52756/ijerr.2024.v43spl.004
- Sep 30, 2024
- International Journal of Experimental Research and Review
The purpose of Network Intrusion Detection Systems (NIDS) is to ensure and protect computer networks from harmful actions. A major concern in NIDS development is the class imbalance problem, i.e., normal traffic dominates the communication data plane more than intrusion attempts. Such a state of affairs can pose certain hazards to the effectiveness of detection algorithms, including those useful for detecting less frequent but still highly dangerous intrusions. This paper aims to utilize resampling techniques to tackle this problem of class imbalance in NIDS using a Support Vector Machine (SVM) classifier alongside utilizing features selected by Random Forest to improve the feature subset selection process. The analysis highlights the combativeness of each sampling method, offering insights into their efficiency and practicality for real-world applications. Four resampling techniques are analyzed. Such techniques include Synthetic Minority Over-sampling Technique (SMOTE), Random Under-sampling (RUS), Random Over-sampling (ROS) and SMOTE with two different combinations i.e., RUS SMOTE and RUS ROS. Feature selection was done using Random Forest, which was improved by Bayesian methods to create subsets of features with feature rankings determined by Cumulative Feature Importance Score (CFIS). The CIDDS-2017 dataset is used for the performance evaluation, and the metrics used include accuracy, precision, recall, F-measure and CPU time. The algorithm that performs best overall in the CFIS feature subsets is SMOTE, and the features that give the best result are selected at the 90% level with 25 features. This subset accomplishes a relative accuracy enhancement of 0.08% than the other approaches. The RUS+ROS technique is also fine but somehow slower than SMOTE. On the other hand, RUS+SMOTE shows relatively poor results although it consumes less time in terms of computational time compared to other methods, giving about 50% of the performance shown by the other methods. This paper's novelty is adapting the RUS method as a standalone test for screening new and potentially contaminated datasets. The standalone RUS method is more efficient in terms of computations; the algorithm returned the best result of 98.13% accuracy at 85% at the CFIS level of 34 features with a computation time of 137.812 s. It is also noted that SMOTE is considered to be proficient among all resampling techniques used for handling the problem of class imbalance in NIDS, vice 90% CFIS feature subset. Future research directions could include using these techniques in different data sets and other machine learning and deep learning methods together with ROC curve analysis to provide useful pointers to NIDS designers on how to select the right data mining tools and strategies for their projects.
- Research Article
50
- 10.1176/ajp.2006.163.10.1697
- Oct 1, 2006
- American Journal of Psychiatry
Our friend and colleague Wayne Fenton asked to write this article for the Journal because of his desire to educate other psychiatrists about the treatment of schizophrenia, including what he recognized to be a growing problem with the metabolic syndrome. This lifelong passion, which he pursued from his psychiatry residency at Yale, through his directorship of Chestnut Lodge, to his position at NIMH as Director of the Division of Adult Translational Research and Associate Director for Clinical Affairs, ended tragically with his killing during an evaluation of a psychotic young man. Wayne had worked tirelessly to secure support for new drug discovery in the NIMH programs that he directed. The Journal will be initiating in 2007 a series of articles on the discovery of new mental illness treatments. We will dedicate this series to Wayne's memory and include with it a memorial of his life and contributions to the treatment of mental illness.
- Conference Article
6
- 10.1109/whispers.2010.5594937
- Jun 1, 2010
Support Vector Machines (SVM) for image classification proved to perform well in many applications. However, they are often not preferred in hyperspectral image analysis due to long processing times caused by a high number of support vectors and large data sets. We present two approaches that speed-up the classification process with SVM by a) simplifying the original SVM, i.e. reducing the number of support vectors, and b) reducing the number of features by selecting relevant, non-redundant features. Results for three classification problems are shown. By applying the two approaches, we observe reduction rates a) between 9.1% and 27.2% for the number of support vectors and b) from 86.8% to 93.0% of features, both without significant decreases in classification accuracy. This enables a fast mapping of complete hyperspectral scenes.
- Research Article
- 10.2196/71994
- Oct 10, 2025
- JMIR Medical Informatics
BackgroundMachine learning (ML) has shown great potential in recognizing complex disease patterns and supporting clinical decision-making. Diabetic foot ulcers (DFUs) represent a significant multifactorial medical problem with high incidence and severe outcomes, providing an ideal example for a comprehensive framework that encompasses all essential steps for implementing ML in a clinically relevant fashion.ObjectiveThis paper aims to provide a framework for the proper use of ML algorithms to predict clinical outcomes of multifactorial diseases and their treatments.MethodsThe comparison of ML models was performed on a DFU dataset. The selection of patient characteristics associated with wound healing was based on outcomes of statistical tests, that is, ANOVA and chi-square test, and validated on expert recommendations. Imputation and balancing of patient records were performed with MIDAS (Multiple Imputation with Denoising Autoencoders) Touch and adaptive synthetic sampling, respectively. Logistic regression, support vector machine (SVM), k-nearest neighbors, random forest (RF), extreme gradient boosting (XGBoost), Bayesian additive regression trees, and artificial neural network were trained, cross-validated, and optimized using random sampling on the patient dataset. To evaluate model calibration and clinical utility, calibration curves, Brier scores, and decision curve analysis (DCA) were performed.ResultsThe exploratory dataset consisted of 700 patient records with 199 variables. After dataset cleaning, the variables used for model training included age, smoking status, toe systolic pressure, blood pressure, oxygen saturation, hemoglobin, hemoglobin A1c, estimated glomerular filtration rate, wound location, diabetes type, Texas wound classification, neuropathy, and wound area measurement. The SVM obtained a stable accuracy of 0.853 (95% CI 0.810-0.896) with an area under the receiver operating characteristic curve of 0.922 (95% CI 0.889-0.955). The RF and XGBoost acquired an accuracy of 0.838 (95% CI 0.793-0.883) and 0.815 (95% CI 0.768-0.862), respectively, with areas under the receiver operating characteristic curve of 0.917 (95% CI 0.883-0.951) for RF and 0.889 (95% CI 0.849-0.929) for XGBoost. SVM, RF, and XGBoost were well-calibrated, with average Brier scores around 0.127 (SD 0.013). DCA showed that the SVM provided the highest net clinical benefit across relevant risk thresholds.ConclusionsHandling missing values, feature selection, and addressing class imbalance are critical components of the key steps in developing ML applications for clinical research. Seven models were selected for comparing their predictive power regarding complete wound healing, and each model representing a different branch in ML. In this initial DFU dataset used as an example, the SVM achieved the best performance in predicting clinical outcomes, followed by RF and XGBoost. The model’s calibration and clinical utility were determined through calibration curves, Brier scores, and DCA, demonstrating its potential relevance in clinical decision-making.
- Conference Article
3
- 10.1145/3474963.3474972
- Jun 25, 2021
Amharic is an ancient Semitic language that serves as the official language of the Federal Republic of Ethiopia. Due to the large number of historical and literary documents written in this language, an automated OCR system is highly demanded. However, previous approaches have been based on traditional machine learning algorithms that focus on hand-crafted feature extraction, and the performance of these methods is greatly affected by the presence of a large set of structurally similar characters. Therefore, according to various studies on Amharic character, this problem can be solved by examining robust feature extraction techniques. In this study, we proposed a hybrid method that uses deep learning models Convolutional Neural Network (CNN) and Convolutional Auto-Encoder (CAE) for feature extraction, Random Forest (RF) and Mutual Information (MI) feature selection methods for selecting top features and a traditional machine learning algorithm Support Vector Machine (SVM) for classification. First, the features extracted by the two deep models were combined to form hybrid features, and then top features were selected by applying feature selection. The common features selected by the two feature selection methods were later used for recognition by SVM. Experimental results using CNN extracted features achieved an accuracy of 96.03% while using CAE extracted features achieved an accuracy of 92.52%. On the other hand, the proposed method based on the intersection features selected by the RF and MI feature selection methods achieved an accuracy of 97.06%.
- Research Article
58
- 10.3390/app10155075
- Jul 23, 2020
- Applied Sciences
Machine learning algorithms are crucial for crop identification and mapping. However, many works only focus on the identification results of these algorithms, but pay less attention to their classification performance and mechanism. In this paper, based on Google Earth Engine (GEE), Sentinel-2 10 m resolution images during a specific phenological period of winter wheat were obtained. Then, support vector machine (SVM), random forest (RF), and classification and regression tree (CART) machine learning algorithms were employed to identify and map winter wheat in a large-scale area. The hyperparameters of the three machine learning algorithms were tuned by grid search and the 5-fold cross-validation method. The classification performance of the three machine learning algorithms were compared, the results of which demonstrate that SVM achieves best performance in identifying winter wheat, and its overall accuracy (OA), user’s accuracy (UA), producer’s accuracy (PA), and kappa coefficient (Kappa) are 0.94, 0.95, 0.95, and 0.92, respectively. Moreover, 50 various combinations of training and validation sets were used to analyze the generalization ability of the algorithms, and the results show that the average OA of SVM, RF, and CART are 0.93, 0.92, and 0.88, respectively, thus indicating that SVM and RF are more robust than CART. To further explore the sensitivity of SVM, RF, and CART to variations of the algorithm parameters—namely, (C and gamma), (tree and split), and (maxD and minSP)—we employed the grid search method to iterate these parameters, respectively, and to analyze the effect of these parameters on the accuracy scores and classification residuals. It was found that with the change of (C and gamma) in (0.01~1000), SVM’s maximum variation of accuracy score is up to 0.63, and the maximum variation of residuals is 76,215 km2. We concluded that SVM is sensitive to the parameters (C and gamma) and presents a positive correlation. When the parameters (tree and split) change between (100~600) and (1~6), respectively, the RF’s maximum variation of accuracy score is 0.08, and the maximum variation of residuals is 1157 km2, indicating that RF is low in sensitivity toward the parameters (tree and split). When the parameters (maxD and minSP) are between (10~60), the maximum accuracy change value is 0.06, and the maximum variation of residuals is 6943 km2. Therefore, compared to RF, CART is sensitive to the parameters (maxD and minSP) and has poor robustness. In general, under the conditions of the hyperparameters, SVM and RF exhibit optimal classification performance, while CART has relatively inferior performance. Meanwhile, SVM, RF, and CART have different sensitivities toward the algorithm parameters; that is, SVM and CART are more sensitive to the algorithm parameters, while RF has low sensitivity toward changes in the algorithm parameters. The different parameters cause great changes in the accuracy scores and residuals, so it is necessary to determine the algorithm hyperparameters. Generally, default parameters can be used to achieve crop classification, but we recommend the enumeration method, similar to grid search, as a practical way to improve the classification performance of the algorithm if the best classification effect is expected.
- Conference Article
7
- 10.1109/cibim.2011.5949221
- Apr 1, 2011
In this paper, we demonstrate the need for dimensionality reduction to mitigate model overfitting on the nontrivial problem of gender classification from digital images. In this study we explore four feature selection schemes using Genetic Algorithm, Memetic Algorithms, and Random Forest, which are fed to a nonlinear support vector machine (SVM) for final classification. The performance of the model (feature) selection approaches are evaluated against two distinct datasets of facial images: FG-NET which contains toddlers to seniors and the UIUC-PAL which contains faces of adults up to seniors. This work demonstrates that feature selection can, and does, improve performance of an SVM based gender classification system significantly.
- Research Article
15
- 10.1088/2057-1976/ac2354
- Sep 15, 2021
- Biomedical Physics & Engineering Express
Grasping of the objects is the most frequent activity performed by the human upper limb. The amputations of the upper limb results in the need for prosthetic devices. The myoelectric prosthetic devices use muscle signals and apply control techniques for identification of different levels of hand gesture and force levels. In this study; a different level force contraction experiment was performed in which Electromyography (EMGs) signals and fingertip force signals were acquired. Using this experimental data; a two-step feature selection process is applied for the designing of a pattern recognition algorithm for the classification of different force levels. The two step feature selection process consist of generalized feature ranking using ReliefF, followed by personalized feature selection using Neighborhood Component Analysis (NCA) from the shortlisted features by earlier technique. The classification algorithms applied in this study were Support Vector Machines (SVM) and Random Forest (RF). Besides feature selection; optimization of the number of muscles during classification of force levels was also performed using designed algorithm. Based on this algorithm; the maximum classification accuracy using SVM classifier and two muscle set was achieved as high as 99%. The optimal feature set consisted features such as Auto Regressive coefficients, Willison Amplitude and Slope Sign Change. The mean classification accuracy for different subjects, achieved using SVM and RF was 94.5% and 91.7% respectively.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.