Phenotype prediction from genome-wide association studies: application to smoking behaviors

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

BackgroundA great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.MethodsIn this research, we investigated the effect of feature selection and prediction algorithm on the performance of prediction method thoroughly. In particular, we considered the following feature selection and prediction methods: regression analysis, regularized regression analysis, linear discriminant analysis, non-linear support vector machine, and random forest. For these methods, we studied the effects of feature selection and the number of features on prediction. Our investigation was based on the analysis of 8,842 Korean individuals genotyped by Affymetrix SNP array 5.0, for predicting smoking behaviors.ResultsTo observe the effect of feature selection methods on prediction performance, selected features were used for prediction and area under the curve score was measured. For feature selection, the performances of support vector machine (SVM) and elastic-net (EN) showed better results than those of linear discriminant analysis (LDA), random forest (RF) and simple logistic regression (LR) methods. For prediction, SVM showed the best performance based on area under the curve score. With less than 100 SNPs, EN was the best prediction method while SVM was the best if over 400 SNPs were used for the prediction.ConclusionsBased on combination of feature selection and prediction methods, SVM showed the best performance in feature selection and prediction.

Similar Papers
  • Research Article
  • Cite Count Icon 54
  • 10.1002/mp.12258
Differentiation of fat-poor angiomyolipoma from clear cell renal cell carcinoma in contrast-enhanced MDCT images using quantitative feature classification.
  • Jun 9, 2017
  • Medical Physics
  • Han Sang Lee + 4 more

To develop a computer-aided classification system to differentiate benign fat-poor angiomyolipoma (fp-AML) from malignant clear cell renal cell carcinoma (ccRCC) using quantitative feature classification on histogram and texture patterns from contrast-enhanced multidetector computer tomography (CE MDCT) images. A dataset including 50 CE MDCT images of 25 fp-AML and 25 ccRCC patients was used. From these images, the tumors were manually segmented by an expert radiologist to define the regions of interest (ROI). A feature classification system was proposed for separating two types of renal masses, using histogram and texture features and machine learning classifiers. First, 64 quantitative image features, including histogram features based on basic histogram characteristics, percentages of pixels above the thresholds, percentile intensities, and texture features based on gray-level co-occurrence matrices (GLCM), gray-level run-length matrices (GLRLM), and local binary patterns (LBP), were extracted from each ROI. A number of feature selection methods including stepwise feature selection (SFS), ReliefF selection, and principal component analysis (PCA) transformation, were applied to select the group of useful features. Finally, the feature classifiers including logistic regression, k nearest neighbors (kNN), support vector machine (SVM), and random forest (RF), were trained on the selected features to differentiate benign fp-AML from malignant ccRCC. Each combination of feature selection and classification methods was tested using a fivefold cross-validation method and evaluated using accuracy, sensitivity, specificity, positive predictive values (PPV), negative predictive values (NPV), and area under receiver operating characteristic curve (AUC). In feature selection, the features commonly selected by different feature selection methods were assessed. From three selection methods, three histogram features including maximum intensity, percentages of pixels above the thresholds 210 and 230, and one texture feature of GLCM sum entropy, were jointly selected as key features to distinguish two types of renal masses. In feature classification, kNN and SVM classifiers with ReliefF feature selection demonstrated the best performance among other choices of feature selection and classification methods, where ReliefF+kNN and ReliefF+SVM achieved the accuracy of 72.3±4.6% and 72.1±4.2%, respectively. We propose a computer-aided classification system for distinguishing fp-AML from ccRCC using machine learning classifiers with quantitative texture features. Our contribution is to investigate the proper combination between the quantitative features and classification systems on the CE MDCT images. In experiments, it can be demonstrated that (a) the features based on histogram characteristics on bright intensity region and texture patterns on inhomogeneity inside masses were selected as key features to classify fp-AML and ccRCC, and (b) the proper combination of feature selection and classification methods achieved high performance in differentiating benign from malignant masses. The proposed classification system can be used to assess the useful features associated with the malignancy for renal masses in CE MDCT images.

  • Research Article
  • Cite Count Icon 68
  • 10.1016/j.procs.2013.10.003
A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma
  • Jan 1, 2013
  • Procedia Computer Science
  • Heba Abusamra

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma

  • Research Article
  • Cite Count Icon 20
  • 10.3390/cancers14122922
Analysis of Cross-Combinations of Feature Selection and Machine-Learning Classification Methods Based on [18F]F-FDG PET/CT Radiomic Features for Metabolic Response Prediction of Metastatic Breast Cancer Lesions
  • Jun 14, 2022
  • Cancers
  • Ober Van Gómez + 6 more

Simple SummaryBreast cancer is a leading cause of morbidity and mortality worldwide. The metastatic disease is largely responsible for cancer patient deaths, and its treatment implies usually different therapies. In this context, the prediction of response to treatment or depiction of treatment-resistant phenotypes is essential in clinical practice, especially in the new era of precision medicine. In this study, we used several combinations of feature selection methods and machine-learning classifiers to construct predictive models of the metabolic response to the treatment of metastatic breast cancer lesions. These models were based on clinical variables and radiomic features extracted from 2-deoxy-2-[18F]fluoro-D-glucose positron emission tomography/computed tomography ([18F]F-FDG PET/CT) images, obtained prior to the treatment. Our main goal was to know if this prediction was feasible and to identify those combinations with better predictive performance. We found that several combinations were successful to predict the metabolic response to treatment, of which the least absolute shrinkage and selection operator (Lasso) + support vector machines (SVM) had the best mean performance in terms of area under the curve, in both training and validation cohorts. Model performances depended largely on the selected combinations.Background: This study aimed to identify optimal combinations between feature selection methods and machine-learning classifiers for predicting the metabolic response of individual metastatic breast cancer lesions, based on clinical variables and radiomic features extracted from pretreatment [18F]F-FDG PET/CT images. Methods: A total of 48 patients with confirmed metastatic breast cancer, who received different treatments, were included. All patients had an [18F]F-FDG PET/CT scan before and after the treatment. From 228 metastatic lesions identified, 127 were categorized as responders (complete or partial metabolic response) and 101 as non-responders (stable or progressive metabolic response), by using the percentage changes in SULpeak (peak standardized uptake values normalized for body lean body mass). The lesion pool was divided into training (n = 182) and testing cohorts (n = 46); for each lesion, 101 image features from both PET and CT were extracted (202 features per lesion). These features, along with clinical and pathological information, allowed the prediction model’s construction by using seven popular feature selection methods in cross-combination with another seven machine-learning (ML) classifiers. The performance of the different models was investigated with the receiver-operating characteristic curve (ROC) analysis, using the area under the curve (AUC) and accuracy (ACC) metrics. Results: The combinations, least absolute shrinkage and selection operator (Lasso) + support vector machines (SVM), or random forest (RF) had the highest AUC in the cross-validation, with 0.93 ± 0.06 and 0.92 ± 0.03, respectively, whereas Lasso + neural network (NN) or SVM, and mutual information (MI) + RF, had the higher AUC and ACC in the validation cohort, with 0.90/0.72, 0.86/0.76, and 87/85, respectively. On average, the models with Lasso and models with SVM had the best mean performance for both AUC and ACC in both training and validation cohorts. Conclusions: Image features obtained from a pretreatment [18F]F-FDG PET/CT along with clinical vaiables could predict the metabolic response of metastatic breast cancer lesions, by their incorporation into predictive models, whose performance depends on the selected combination between feature selection and ML classifier methods.

  • Conference Article
  • 10.1109/dsaa49011.2020.00045
Joint Bayesian Variable Selection and Graph Estimation for Non-linear SVM with Application to Genomics Data
  • Oct 1, 2020
  • Wenli Sun + 2 more

Support vector machine (SVM) is a powerful classification tool for analysis of high dimensional data such as genomics. Regularized linear and nonlinear SVM methods with feature selection have been developed. On the other hand, there is a growing body of literature showing that incorporating prior biological knowledge such as functional genomics, which are typically represented by graphs, into the analysis of genomic data can improve feature selection and prediction. In practice, however, such biological knowledge can often be inaccurate or unavailable. To attack this problem, we propose a Bayesian modeling approach which enables us to learn the graph structure among features and perform feature selection simultaneously. Our approach employs a Gaussian graphical model for inferring the graphical information and exploits the inferred graph to guide feature selection for SVM. An efficient MCMC algorithm is developed and our numerical analysis demonstrates that the proposed method has advantages over existing methods in feature selection and prediction via simulations and an application to the analysis of glioblastoma patient data.

  • Book Chapter
  • Cite Count Icon 12
  • 10.1007/978-3-030-22871-2_66
Performance Analysis of Feature Selection Methods for Classification of Healthcare Datasets
  • Jan 1, 2019
  • Omesaad Rado + 4 more

Classification analysis is widely used in enhancing the quality of healthcare applications by analysing data and discovering hidden patterns and relationships between the features, which can be used to support medical diagnostic decisions and improving the quality of patient care. Usually, a healthcare dataset may contain irrelevant, redundant, and noisy features; applying classification algorithms to such type of data may produce a less accurate and a less understandable results. Therefore, selection of optimal features has a significant influence on enhancing the accuracy of classification systems. Feature selection method is an effective data pre-processing technique in data mining, which can be used to identify a minimum set of features. This type of technique has immediate effects on speeding up classification algorithms and improving performance such as predictive accuracy. This paper, aims to evaluate the performance of five different classification methods including: C5.0, Rpart, k-nearest neighbor (KNN), Support Vector Machines (SVM), and Random Forest (RF), with three different feature selection methods, including: correlation-based feature selection method, Variables Importance selection method, and Recursive Feature elimination selection method on seven relevant numerical and mixed healthcare datasets. Ten-fold cross validation is used to evaluate the classification performance. The experiments showed that there is a variation of the effect of feature selection methods on the performance of classification techniques.

  • Research Article
  • Cite Count Icon 1
  • 10.1186/s12911-022-02051-w
Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods
  • Nov 23, 2022
  • BMC Medical Informatics and Decision Making
  • Ali Ebrahimi + 5 more

BackgroundHigh dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD.MethodsA FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs.ResultsIn terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD.ConclusionOur proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 46
  • 10.1186/s12911-017-0434-4
Automatic migraine classification via feature selection committee and machine learning techniques over imaging and questionnaire data
  • Apr 13, 2017
  • BMC medical informatics and decision making
  • Yolanda Garcia-Chimeno + 4 more

BackgroundFeature selection methods are commonly used to identify subsets of relevant features to facilitate the construction of models for classification, yet little is known about how feature selection methods perform in diffusion tensor images (DTIs). In this study, feature selection and machine learning classification methods were tested for the purpose of automating diagnosis of migraines using both DTIs and questionnaire answers related to emotion and cognition – factors that influence of pain perceptions.MethodsWe select 52 adult subjects for the study divided into three groups: control group (15), subjects with sporadic migraine (19) and subjects with chronic migraine and medication overuse (18). These subjects underwent magnetic resonance with diffusion tensor to see white matter pathway integrity of the regions of interest involved in pain and emotion. The tests also gather data about pathology. The DTI images and test results were then introduced into feature selection algorithms (Gradient Tree Boosting, L1-based, Random Forest and Univariate) to reduce features of the first dataset and classification algorithms (SVM (Support Vector Machine), Boosting (Adaboost) and Naive Bayes) to perform a classification of migraine group. Moreover we implement a committee method to improve the classification accuracy based on feature selection algorithms.ResultsWhen classifying the migraine group, the greatest improvements in accuracy were made using the proposed committee-based feature selection method. Using this approach, the accuracy of classification into three types improved from 67 to 93% when using the Naive Bayes classifier, from 90 to 95% with the support vector machine classifier, 93 to 94% in boosting. The features that were determined to be most useful for classification included are related with the pain, analgesics and left uncinate brain (connected with the pain and emotions).ConclusionsThe proposed feature selection committee method improved the performance of migraine diagnosis classifiers compared to individual feature selection methods, producing a robust system that achieved over 90% accuracy in all classifiers. The results suggest that the proposed methods can be used to support specialists in the classification of migraines in patients undergoing magnetic resonance imaging.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 55
  • 10.3390/sym12071147
Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study
  • Jul 9, 2020
  • Symmetry
  • Abdullateef O Balogun + 9 more

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naïve Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

  • Research Article
  • 10.2174/0113892002268739231211063718
Drug-Protein Interactions Prediction Models Using Feature Selection and Classification Techniques.
  • Dec 1, 2023
  • Current drug metabolism
  • T Idhaya + 2 more

Drug-Protein Interaction (DPI) identification is crucial in drug discovery. The high dimensionality of drug and protein features poses challenges for accurate interaction prediction, necessitating the use of computational techniques. Docking-based methods rely on 3D structures, while ligand-based methods have limitations such as reliance on known ligands and neglecting protein structure. Therefore, the preferred approach is the chemogenomics-based approach using machine learning, which considers both drug and protein characteristics for DPI prediction. In machine learning, feature selection plays a vital role in improving model performance, reducing overfitting, enhancing interpretability, and making the learning process more efficient. It helps extract meaningful patterns from drug and protein data while eliminating irrelevant or redundant information, resulting in more effective machine-learning models. On the other hand, classification is of great importance as it enables pattern recognition, decision-making, predictive modeling, anomaly detection, data exploration, and automation. It empowers machines to make accurate predictions and facilitates efficient decision-making in DPI prediction. For this research work, protein data was sourced from the KEGG database, while drug data was obtained from the DrugBank data machine-learning base. To address the issue of imbalanced Drug Protein Pairs (DPP), different balancing techniques like Random Over Sampling (ROS), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive SMOTE were employed. Given the large number of features associated with drugs and proteins, feature selection becomes necessary. Various feature selection methods were evaluated: Correlation, Information Gain (IG), Chi-Square (CS), and Relief. Multiple classification methods, including Support Vector Machines (SVM), Random Forest (RF), Adaboost, and Logistic Regression (LR), were used to predict DPI. Finally, this research identifies the best balancing, feature selection, and classification methods for accurate DPI prediction. This comprehensive approach aims to overcome the limitations of existing methods and provide more reliable and efficient predictions in drug-protein interaction studies.

  • Conference Article
  • 10.3990/2.378
Value of feature reduction for crop differentiation using multi-temporal imagery, machine learning, and object-based image analysis
  • Jan 1, 2016
  • J.K Gilbertson + 1 more

This study examined the value of automated and manual feature selection, when applied to machine learning and object-based image analysis (OBIA), for the differentiation of crops in a Mediterranean climate. Five Landsat8 images covering the phenological stages of seven major crops types in the study area (Cape Winelands, South Africa) were acquired and processed. A statistical image fusion technique was used to enhance the spatial resolution of the imagery. The pan-sharpened imagery was used to produce a range of spectral features, textural measures, indices and colour transformations, after which it was segmented using the multi-resolution (MRS) algorithm. The entire set of 205 features (41 per image capture date) was then subjected to different feature selection and reduction methods. The feature selection and reduction methods included manual feature removal (i.e. grouping into semantic themes), filter methods (such as classification and regression trees (CART) and random forest (RF)), and statistical principal components analysis (PCA). The experiments were carried out in two scenarios, namely 1) on all input images in combination; and 2) on each individual image date. The feature subsets were used as input to decision trees (DTs), k-nearest neighbour (k-NN), support vector machine (SVM), and random forest (RF) machine learning classifiers. In order to assess the value of each feature reduction method (comprising feature reduction and selection techniques), overall accuracy, kappa coefficient and McNemar’s test were employed to assess classification accuracy and compare the results. The results show that feature selection was able to improve the overall crop identification accuracy for the DT, k-NN, and RF classifiers, but was unable to do so for SVM. SVM scored the highest overall accuracy and kappa coefficient, even without applying feature reduction or selection. Based on these results it was concluded that, although feature selection can aid the crop differentiation process, it is not a necessity.

  • Research Article
  • Cite Count Icon 14
  • 10.1007/s11053-018-9422-3
Geochemical Prospectivity Mapping Through a Feature Extraction–Selection Classification Scheme
  • Oct 26, 2018
  • Natural Resources Research
  • Hamid Zekri + 3 more

Machine learning (ML) schemes can enhance success in geochemical prospectivity mapping. This study has examined the effectiveness of several feature extraction or selection approaches, using a variety of ML algorithms applied to multielement soil and lithogeochemical data, to identify new prospective Pb–Zn mineralisation in the Irankuh area. Singular value decomposition (SVD) was used as a dimensionality reduction technique to remove noise in the geochemical data. This was followed by application of feature selection techniques including filter-based methods such as principal component analysis (PCA), Pearson’s correlation coefficient (PCC), correlation-based feature selection (CFS), information gain ratio (IGR) and wrapper models, in combination with support vector machines, logistic regression and random forests analysis. The performance of the ML algorithms, assisted by feature extraction and selection methods, was subsequently assessed using a 10-fold cross-validation of separate training and testing data subsets. SVD boosted the performance of support vector machines, logistic regression and random forests. The ML algorithms are particularly effective when using two transformed principal components that are linked to a suite of elements associated with the sulphide mineralisation and variations in the host lithologies. PCA and PCC techniques generally suit support vector machines as the most effective feature selection methods. Logistic regression provided a better classification with PCA, IGR and a wrapper model. However, random forests delivered more accurate outcomes using PCA and PCC techniques. A geochemical prospectivity map of the study area has been derived from support vector machines, trained with two principal components as the best performing ML scheme, and has generated three new and distinct targets for more detailed exploration.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.17485/ijst/v16i10.2102
Feature Selection Techniques in Learning Algorithms to Predict Truthful Data
  • Mar 12, 2023
  • Indian Journal Of Science And Technology
  • P Usha + 1 more

Objectives: This review focuses on various feature selection process, strategy, and methods such as filter, wrapper and embedded algorithms and its advantages and disadvantages are presented. Methods: The algorithms such as Mutual Information Gain (MIG), Chi-Square (CS) and Recursive Feature Elimination (RFE) are used to select features. In this review, two benchmark datasets: Breast cancer and Diabetes are used. Findings: To improve the efficiency, selection of appropriate feature selection methods and algorithms are most important. To measure the performance of these selected features Random Forest model used as classifiers and compared with Support Vector Machine and Decision Tree models. Filter method and algorithm selects up to 15 features out of 17 for diabetes dataset with 89 % to 98 % of accuracy. For breast cancer dataset, up to 28 features out of 31 features selected with 98.5 % of accuracy. Wrapper method RFE selects 14 features from 17 for diabetes and 10 out of 31 features selected for breast cancer. This RFE method shows up to 98.25 % of accuracy for diabetes and 99.20% of accuracy for breast cancer. Novelty: Feature selection techniques help to improve the performance, efficiency and decrease the storage and processing time and build a better model for further process in prediction. The proper feature selection helps to diagnose diseases at an earlier stage and improve the survival of human beings. Keywords: Mutual Information Gain; ChiSquare; Recursive Feature Elimination; Support Vector Machine; Random Forest; Decision Tree

  • Research Article
  • Cite Count Icon 9
  • 10.17576/jsm-2021-5003-17
Predicting 30-Day Mortality after an Acute Coronary Syndrome (ACS) using Machine Learning Methods for Feature Selection, Classification and Visualisation
  • Mar 31, 2021
  • Sains Malaysiana
  • Nanyonga Aziida + 4 more

Hybrid combinations of feature selection, classification and visualisation using machine learning (ML) methods have the potential for enhanced understanding and 30-day mortality prediction of patients with cardiovascular disease using population-specific data. Identifying a feature selection method with a classifier algorithm that produces high performance in mortality studies is essential and has not been reported before. Feature selection methods such as Boruta, Random Forest (RF), Elastic Net (EN), Recursive Feature Elimination (RFE), learning vector quantization (LVQ), Genetic Algorithm (GA), Cluster Dendrogram (CD), Support Vector Machine (SVM) and Logistic Regression (LR) were combined with RF, SVM, LR, and EN classifiers for 30-day mortality prediction. ML models were constructed using 302 patients and 54 input variables from the Malaysian National Cardiovascular Disease Database. Validation of the best ML model was performed against Thrombolysis in Myocardial Infarction (TIMI) using an additional dataset of 102 patients. The Self-Organising Feature Map (SOM) was used to visualise mortality-related factors post-ACS. The performance of MLmodels using the area under the curve (AUC) ranged from 0.48 to 0.80. The best-performing model (AUC = 0.80) was a hybrid combination of the RF variable importance method, the sequential backward selection and the RF classifier using five predictors (age, triglyceride, creatinine, troponin, and total cholesterol). Comparison with TIMI using an additional dataset resulted in the best ML model outperforming the TIMI score (AUC = 0.75 vs. AUC = 0.60). The findings of this study will provide a basis for developing an online ML-based population-specific risk scoring calculator.

  • Research Article
  • 10.1158/1538-7445.am2025-1107
Abstract 1107: Feature selection and machine learning strategies optimize an affordable molecular assay for cholangiocarcinoma subtype
  • Apr 21, 2025
  • Cancer Research
  • Ellen Larson + 7 more

Background: Cholangiocarcinoma is a molecularly heterogenous cancer arising from the biliary epithelium. Individual omic techniques have identified clinically relevant, targetable subtypes but most tumors have no targetable alterations based on exome or transcriptomic evaluation. Integrated multiomics approaches incorporating transcriptomic, proteomic, and phosphoproteomic characterization provide deeper understanding and identify non-mutated, activated pathways that can be targeted. To optimize the potential clinical utility of these approaches, minimal classifier features must be defined. Methods: Approximately 60,000 RNA, protein, and phosphoprotein features were extracted from 35 cholangiocarcinomas treated at the primary US study site. 206 patients treated and profiled at an institution in China were also included. Using the Multiomics Factor Analysis clustering method, patients were sorted into 3 distinct molecular profiles. Important molecular features were selected using competing feature selection (FS) methods: Boruta, RreliefF, variable selection using random forests (RF), recursive factor elimination (RFE), median decrease accuracy in RF, and Concrete Autoencoders. Machine learning classification prediction models including Extreme Gradient Boosting, gradient-boosted trees, multinomial logit, and support vector machine (SVM) were tested on each FS. Each combination of FS and predictive model was tested with 10 replicates of 10-fold cross-validation of a randomly-selected balanced test set and an oversampled, balanced training subset. Accuracy, F1 score, AUC, precision, and recall were assessed for each FS and predictive model combination. A custom function of physical assay cost and classification accuracy was optimized to find the best assay. KEGG pathway overrepresentation analysis was used to analyze feature subsets. Results: A consistent set of 50 important RNA and protein features was identified, including previously reported prognostic factors, like SpryD4 protein expression, as well as previously unreported pathway surrogates, such as Slc2A2 or Pipox protein. Pathway overrepresentation analysis found insulin signaling and resistance and amino acid biosynthesis/metabolism pathways were overrepresented in the discriminative feature set. The gradient-boosted trees algorithm combined with the RFE FS method had the best performance, with a multiclass macro average (±sd) cross-validation accuracy 89% (±6%), F1 score 89% (±8%), AUC 0.98 (±0.02), precision 90% (±10%), recall 89% (±6%). Conclusions: Feature selection methods and predictive machine learning models work synchronously to develop accurate molecular subtype assays based on just 50 protein and RNA levels. This strategy has the potential to make molecular subtyping for cholangiocarcinoma and other molecularly complex cancers fast and widely available. Citation Format: Ellen Larson, Erik Jessen, Dong-Gi Mun, Jennifer Tomlinson, Amro Abdelrahman, Danielle Carlson, Hojjat Salehinejad, Rory Smoot. Feature selection and machine learning strategies optimize an affordable molecular assay for cholangiocarcinoma subtype [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 1107.

  • Research Article
  • Cite Count Icon 45
  • 10.1007/s00330-020-06768-y
Classification of pulmonary lesion based on multiparametric MRI: utility of radiomics and comparison of machine learning methods.
  • Mar 28, 2020
  • European Radiology
  • Xinhui Wang + 4 more

We develop and validate a radiomics model based on multiparametric magnetic resonance imaging (MRI) in the classification of the pulmonary lesion and identify optimal machine learning methods. This retrospective analysis included 201 patients (143 malignancies, 58 benign lesions). Radiomics features were extracted from multiparametric MRI, including T2-weighted imaging (T2WI), T1-weighted imaging (TIWI), and apparent diffusion coefficient (ADC) map. Three feature selection methods, including recursive feature elimination (RFE), t test, and least absolute shrinkage and selection operator (LASSO), and three classification methods, including linear discriminate analysis (LDA), support vector machine (SVM), and random forest (RF) were used to distinguish benign and malignant pulmonary lesions. Performance was compared by AUC, sensitivity, accuracy, precision, and specificity. Analysis of performance differences in three randomly drawn cross-validation sets verified the stability of the results. For most single MR sequences or combinations of multiple MR sequences, RFE feature selection method with SVM classifier had the best performance, followed by RFE with RF. The radiomics model based on multiple sequences showed a higher diagnostic accuracy than single sequence for every machine learning method. Using RFE with SVM, the joint model of T1WI, T2WI, and ADC showed the highest performance with AUC = 0.88 ± 0.02 (sensitivity 83%; accuracy 82%; precision 91%; specificity 79%) in test set. Quantitative radiomics features based on multiparametric MRI have good performance in differentiating lung malignancies and benign lesions. The machine learning method of RFE with SVM is superior to the combination of other feature selection and classifier methods. • Radiomics approach has the potential to distinguish between benign and malignant pulmonary lesions. • Radiomics model based on multiparametric MRI has better performance than single-sequence models. • The machine learning methods RFE with SVM perform best in the current cohort.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.