Data-driven logistic regression ensembles with applications in genomics

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Data-driven logistic regression ensembles with applications in genomics

Similar Papers
  • Research Article
  • 10.47065/josyc.v5i3.5168
Comparative Analysis of Machine Learning Models for Classifying Human DNA Sequences: Performance Metrics and Strategic Recommendations
  • May 31, 2024
  • Journal of Computer System and Informatics (JoSYC)
  • Gregorius Airlangga

This study presents a comprehensive evaluation of seven machine learning models applied to the classification of human DNA sequences, highlighting their performance and potential applications in genomics. We explored Logistic Regression, Support Vector Machines (SVM), Random Forest, Decision Trees, Gradient Boosting, Naive Bayes, and XGBoost, using a 5-fold StratifiedKFold cross-validation method to ensure robustness and reliability in our findings. Naive Bayes demonstrated exceptional performance with near-perfect accuracy, precision, recall, and F1 scores, suggesting its suitability for rapid and efficient genomic classification. Logistic Regression also showed high efficacy, proving effective even in multi-class classifications of complex genetic data. Conversely, Decision Trees and SVM struggled with overfitting and computational efficiency, respectively, indicating the need for careful parameter tuning and optimization in practical applications. The study addresses these challenges and proposes strategies for enhancing model robustness and computational efficiency, such as advanced regularization techniques and hybrid modeling approaches. These insights not only aid in selecting appropriate models for specific genomic tasks but also pave the way for future research into integrating machine learning with genomic science to advance personalized medicine and genetic research. The findings encourage ongoing refinement of these models to unlock further potential in genomic applications.

  • Research Article
  • Cite Count Icon 50
  • 10.3402/gha.v9.31026
Building local capacity for genomics research in Africa: recommendations from analysis of publications in Sub-Saharan Africa from 2004 to 2013
  • May 12, 2016
  • Global Health Action
  • Babatunde O Adedokun + 2 more

BackgroundThe poor genomics research capacity of Sub-Saharan Africa (SSA) could prevent maximal benefits from the applications of genomics in the practice of medicine and research. The objective of this study is to examine the author affiliations of genomic epidemiology publications in order to make recommendations for building local genomics research capacity in SSA.DesignSSA genomic epidemiology articles published between 2004 and 2013 were extracted from the Human Genome Epidemiology (HuGE) database. Data on authorship details, country of population studied, and phenotype or disease were extracted. Factors associated with the first author, who has an SSA institution affiliation (AIAFA), were determined using a Chi-square test and multiple logistic regression analysis.ResultsThe most commonly studied population was South Africa, accounting for 31.1%, followed by Ghana (10.6%) and Kenya (7.5%). About one-tenth of the papers were related to non-communicable diseases (NCDs) such as cancer (6.1%) and cardiovascular diseases (CVDs) (4.3%). Fewer than half of the first authors (46.9%) were affiliated with an African institution. Among the 238 articles with an African first author, over three-quarters (79.8%) belonged to a university or medical school, 16.8% were affiliated with a research institute, and 3.4% had affiliations with other institutions.ConclusionsSignificant disparities currently exist among SSA countries in genomics research capacity. South Africa has the highest genomics research output, which is reflected in the investments made in its genomics and biotechnology sector. These findings underscore the need to focus on developing local capacity, especially among those affiliated with SSA universities where there are more opportunities for teaching and research.

  • Research Article
  • Cite Count Icon 12
  • 10.1177/0272989x14565820
Value of Genetic Testing for Hereditary Colorectal Cancer in a Probability-Based US Online Sample.
  • Jan 14, 2015
  • Medical Decision Making
  • Sara J Knight + 5 more

. While choices about genetic testing are increasingly common for patients and families, and public opinion surveys suggest public interest in genomics, it is not known how adults from the general population value genetic testing for heritable conditions. We sought to understand in a US sample the relative value of the characteristics of genetic tests to identify risk of hereditary colorectal cancer, among the first genomic applications with evidence to support its translation to clinical settings. . A Web-enabled choice-format conjoint survey was conducted with adults age 50 years and older from a probability-based US panel. Participants were asked to make a series of choices between 2 hypothetical blood tests that differed in risk of false-negative test, privacy, and cost. Random parameters logit models were used to estimate preferences, the dollar value of genetic information, and intent to have genetic testing. . A total of 355 individuals completed choice-format questions. Cost and privacy were more highly valued than reducing the chance of a false-negative result. Most (97% [95% confidence interval (CI)], 95%-99%) would have genetic testing to reduce the risk of dying of colorectal cancer in the best scenario (no false negatives, results disclosed to primary care physician). Only 41% (95% CI, 25%-57%) would have genetic testing in the worst case (20% false negatives, results disclosed to insurance company). . Given the characteristics and levels included in the choice, if false-negative test results are unlikely and results are shared with a primary care physician, the majority would have genetic testing. As genomic services become widely available, primary care professionals will need to be increasingly knowledgeable about genetic testing decisions.

  • Research Article
  • Cite Count Icon 2
  • 10.1093/bioinformatics/btac086
Cox regression is robust to inaccurate EHR-extracted event time: an application to EHR-based GWAS.
  • Feb 14, 2022
  • Bioinformatics
  • Rebecca Irlmeier + 4 more

Logistic regression models are used in genomic studies to analyze the genetic data linked to electronic health records (EHRs), and do not take full usage of the time-to-event information available in EHRs. Previous work has shown that Cox regression, which can account for left truncation and right censoring in EHRs, increased the power to detect genotype-phenotype associations compared to logistic regression. We extend this to evaluate the relative performance of Cox regression and various logistic regression models in the presence of positive errors in event time (delayed event time), relating to recorded event time accuracy. One Cox model and three logistic regression models were considered under different scenarios of delayed event time. Extensive simulations and a genomic study application were used to evaluate the impact of delayed event time. While logistic regression does not model the time-to-event directly, various logistic regression models used in the literature were more sensitive to delayed event time than Cox regression. Results highlighted the importance to identify and exclude the patients diagnosed before entry time. Cox regression had similar or modest improvement in statistical power over various logistic regression models at controlled type I error. This was supported by the empirical data, where the Cox models steadily had the highest sensitivity to detect known genotype-phenotype associations under all scenarios of delayed event time. Access to individual-level EHR and genotype data is restricted by the IRB. Simulation code and R script for data process are at: https://github.com/QingxiaCindyChen/CoxRobustEHR.git. Supplementary data are available at Bioinformatics online.

  • Research Article
  • Cite Count Icon 7
  • 10.1186/1752-0509-6-s2-s11
Phenotype prediction from genome-wide association studies: application to smoking behaviors
  • Dec 1, 2012
  • BMC Systems Biology
  • Dankyu Yoon + 2 more

BackgroundA great success of the genome wide association study enabled us to give more attention on the personal genome and clinical application such as diagnosis and disease risk prediction. However, previous prediction studies using known disease associated loci have not been successful (Area Under Curve 0.55 ~ 0.68 for type 2 diabetes and coronary heart disease). There are several reasons for poor predictability such as small number of known disease-associated loci, simple analysis not considering complexity in phenotype, and a limited number of features used for prediction.MethodsIn this research, we investigated the effect of feature selection and prediction algorithm on the performance of prediction method thoroughly. In particular, we considered the following feature selection and prediction methods: regression analysis, regularized regression analysis, linear discriminant analysis, non-linear support vector machine, and random forest. For these methods, we studied the effects of feature selection and the number of features on prediction. Our investigation was based on the analysis of 8,842 Korean individuals genotyped by Affymetrix SNP array 5.0, for predicting smoking behaviors.ResultsTo observe the effect of feature selection methods on prediction performance, selected features were used for prediction and area under the curve score was measured. For feature selection, the performances of support vector machine (SVM) and elastic-net (EN) showed better results than those of linear discriminant analysis (LDA), random forest (RF) and simple logistic regression (LR) methods. For prediction, SVM showed the best performance based on area under the curve score. With less than 100 SNPs, EN was the best prediction method while SVM was the best if over 400 SNPs were used for the prediction.ConclusionsBased on combination of feature selection and prediction methods, SVM showed the best performance in feature selection and prediction.

  • Research Article
  • 10.1002/cpz1.70046
GEDI: An R Package for Integration of Transcriptomic Data from Multiple Platforms for Bioinformatics Applications.
  • Oct 1, 2024
  • Current protocols
  • Mathias N Stokholm + 2 more

Transcriptomic data is often expensive and difficult to generate in large cohorts relative to genomic data; therefore, it is often important to integrate multiple transcriptomic datasets from both microarray- and next generation sequencing (NGS)-based transcriptomic data across similar experiments or clinical trials to improve analytical power and discovery of novel transcripts and genes. However, transcriptomic data integration presents a few challenges including reannotation and batch effect removal. We developed the Gene Expression Data Integration (GEDI) R package to enable transcriptomic data integration by combining existing R packages. With just four functions, the GEDI R package makes constructing a transcriptomic data integration pipeline straightforward. Together, the functions overcome the complications in transcriptomic data integration by automatically reannotating the data and removing the batch effect. The removal of the batch effect is verified with principal component analysis and the data integration is verified using a logistic regression model with forward stepwise feature selection. To demonstrate the functionalities of the GEDI package, we integrated five bovine endometrial transcriptomic datasets from the NCBI Gene Expression Omnibus. These transcriptomic datasets were from multiple high-throughput platforms, namely, array-based Affymetrix and Agilent platforms, and NGS-based Illumina paired-end RNA-seq platform. Furthermore, we compared the GEDI package to existing tools and found that GEDI is the only tool that provides a full transcriptomic data integration pipeline including verification of both batch effect removal and data integration for downstream genomic and bioinformatics applications. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ReadGE, a function to import gene expression datasets Basic Protocol 2: GEDI, a function to reannotate and merge gene expression datasets Basic Protocol 3: BatchCorrection, a function to remove batch effects from gene expression data Basic Protocol 4: VerifyGEDI, a function to confirm successful integration of gene expression data.

  • Research Article
  • 10.1093/ofid/ofab466.1440
1248. A Machine-Learning Approach to Predict the Cefazolin Inoculum Effect in Methicillin-Susceptible Staphylococcus aureus
  • Dec 4, 2021
  • Open Forum Infectious Diseases
  • Rafael Rios + 14 more

Background The cefazolin (Cz) inoculum effect (CzIE), defined as an increase in the Cz MIC to ≥16 µg/mL at high inoculum (107 CFU/mL), has been associated with poor outcomes in MSSA bacteremia and osteomyelitis. The CzIE is associated with the BlaZ β-lactamase, encoded by blaZ and regulated by BlaR (antibiotic sensor) and BlaI (transcriptional repressor). Here, we aimed to obtain a machine-learning (ML) model to predict the presence of the CzIE based on the nucleotide sequence of the entire bla operon and its regulatory components. Methods Using whole genome sequencing, we analyzed the nucleotide sequences of the entire bla operon in 436 MSSA isolates recovered from blood, soft-tissue infections or pneumonia in adults (training-testing cohort, prevalence of the CzIE: 46%). Also, 32 MSSA recovered from pediatric patients with osteomyelitis with the CzIE were included as validation cohort. The CzIE was determined by broth microdilution at high inoculum. K-mer counts were obtained from the bla operon sequences of the isolates from the testing-training cohort, and then used in a ML pipeline which i) discards uninformative K-mers, ii) identifies optimal hyper-parameters and, iii) performs training of the model using 70% of the sequences as training set and 30% as testing set. The pipeline tested 11 different K-mer sizes and 2 models: Logistic Regression (LR) and Support Vector Machine (SVM). Finally, the model with best predictive ability was applied to the sequences of the MSSA osteomyelitis isolates (validation cohort). Results The ML approach had high specificity ( >90%), accuracy ( >80%) and ROC-AUC values ( >0.7) for detecting the CzIE in the testing set of isolates (Figure 1), independently of the type of model or the K-mer size used. The best predictive ability was with LR using K-mers of 17 nucleotides, with an accuracy of 84%, specificity of 96%, and sensitivity of 70% in the testing set (Figure 2). In the validation cohort, the model was capable to correctly identify all the strains exhibiting the CzIE (100% sensitivity). Figure 1. Prediction metrics of the ML pipeline for the detection of the CzIE in MSSA isolates from the training-test cohort. Predictions are shown accordingly to the model and K-mer sizes tested. Figure 2. ROC of best predictive model (Logistic Regression, K-mer size 17) for the detection of the CzIE in MSSA isolates. Conclusion The ML approach is a promising genomic application to detect the CzIE in MSSA isolates of a variety of sources, bypassing phenotypic testing. Further validation is needed to evaluate its possible utility in clinical settings. Disclosures Jonathon C. McNeil, MD, Agency for Healthcare Research and Quality (Research Grant or Support)Allergan (Grant/Research Support)Nabriva (Grant/Research Support, Other Financial or Material Support, Site PI for a multicenter trial) Anthony R. Flores, MD, MPH, PhD, Nothing to disclose Sheldon L. Kaplan, MD, Pfizer (Research Grant or Support) Cesar A. Arias, M.D., MSc, Ph.D., FIDSA, Entasis Therapeutics (Grant/Research Support)MeMed Diagnostics (Grant/Research Support)Merk (Grant/Research Support) Lorena Diaz, PhD , Nothing to disclose

  • Book Chapter
  • 10.1007/978-3-319-28121-6_2
An Empirical Study of a Large Scale Online Recommendation System
  • Jan 1, 2015
  • Huazheng Fu + 2 more

The online recommendation service has a wide range of usages for the various applications of Telecommunication companies. For such applications, the user base is usually tremendous with a variety of user characteristics and habits. Therefore, it is a challenge to achieve the high click through rate (CTR) for the online recommendations. In this paper, we proposed an approach of combining the technologies of ensemble trees and logistic regression (LR). The ensemble trees are effective in capturing the joint information of different features, which are then used by the LR scheme. In addition, to deal with the scalability issues, we implemented our system with both Apache Storm (for real-time prediction and classification) and Apache Spark (for fast off-line model training). A group of experiments were carried out with real-world data sets and the results show the efficiency and effectiveness of our proposed approach.

  • Research Article
  • Cite Count Icon 126
  • 10.1007/s00439-021-02411-y
Embeddings from protein language models predict conservation and variant effects
  • Dec 30, 2021
  • Human Genetics
  • Céline Marquet + 7 more

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

  • Research Article
  • Cite Count Icon 53
  • 10.1002/ijc.23138
Genetic predictors of long‐term toxicities after radiation therapy for breast cancer
  • Nov 20, 2007
  • International Journal of Cancer
  • Nataliya Kuptsova + 10 more

Telangiectasia and subcutaneous fibrosis are the most common late dermatologic side effects observed in response to radiation treatment. Radiotherapy acts on cancer cells largely due to the generation of reactive oxygen species (ROS). ROS also induce normal tissue toxicities. Therefore, we investigated if genetic variation in oxidative stress-related enzymes confers increased susceptibility to late skin complications. Women who received radiotherapy following lumpectomy for breast cancer were followed prospectively for late tissue side effects after initial treatment. Final analysis included 390 patients. Polymorphisms in genes involved in oxidative stress-related mechanisms (GSTA1, GSTM1, GSTT1, GSTP1, MPO, MnSOD, eNOS, CAT) were determined from blood samples by MALDI-TOF. The associations between telangiectasia and genotypes were evaluated by multivariate unconditional logistic regression models. Patients with variant GSTA1 genotypes were at significantly increased risk of telangiectasia (OR 1.86, 95% CI 1.11-3.11). Reduced odds ratios of telangiectasia were noted for women with lower-activity eNOS genotype (OR 0.58, 95% CI 0.36-0.93). Genotype effects were modified by follow-up time, with the highest risk observed after 4 years of radiotherapy for gene polymorphisms in ROS-neutralizing enzymes. Decreased risk with eNOS polymorphisms was significant only among women with less than 4 years of follow-up. All other risk estimates were nonsignificant. Late effects of radiation therapy on skin appear to be modified by variants in genes related to protection from oxidative stress. The application of genomics to outcomes following radiation therapy holds the promise of radiation dose adjustment to improve both cosmetic outcomes and quality of life for breast cancer patients.

  • Research Article
  • Cite Count Icon 14
  • 10.1021/acsearthspacechem.1c00344
Novel Application of Machine Learning Techniques for Rapid Source Apportionment of Aerosol Mass Spectrometer Datasets
  • Apr 4, 2022
  • ACS Earth and Space Chemistry
  • Paritosh Pande + 13 more

In this work, we apply machine learning approaches sparse multinomial logistic regression to classify aerosol mass spectrometer (AMS) unit mass resolution (UMR) data followed by an ensemble regression technique for source apportionment of organic aerosols (OA). The classifier was trained on 60 well characterized laboratory and positive matrix factorization (PMF) deconvolved reference spectra to identify eight OA types. These include four laboratory-derived secondary organic aerosol (SOA) spectra, which include isoprene photooxidation SOA, isoprene epoxydiols (IEPOX) SOA, a monoterpene SOA type that includes a-pinene and ß-pinene SOA, and aromatic SOA from oxidation of naphthalene and m-xylene precursors, as well as PMF deconvolved spectra for three primary organic aerosol (POA) types, namely, hydrocarbon-like organic aerosol (HOA), biomass burning organic aerosol (BBOA), and cooking OA (COA), and a more oxidized oxygenated OA type (MO-OOA). A 5-fold cross-validation strategy, repeated 10 times, was used to assess the classifier’s performance. The classifier had high classification accuracy for COA, aromatic SOA, and isoprene SOA spectra but incorrectly classified ~9% by number of MO-OOA spectra as BBOA, 12% of BBOA spectra as HOA (and vice versa), and 18% of IEPOX-SOA spectra as aromatic SOA. Next, an ensemble regression model was trained on an artificially generated dataset consisting of mixtures of different OA types to assess its ability to predict fractional mass abundances from classification probabilities of various OA species obtained from the multinomial logistic regression classifier trained on the reference spectra. Ultimately, the proposed approach was applied for source apportionment of aircraft-based AMS measurements of OA UMR spectra during the HI-SCALE field campaign. On two representative days (May 6th and 18th, 2016), the algorithm determined that ~50-60% of OA by mass was MO-OOA, which represented a highly aged organic aerosol mixture from different sources. On both days, BBOA was determined to contribute less than 10% to OA by mass. However, on May 18th, the aromatic SOA fraction was higher compared to that on May 6th. The proposed approach is capable of rapidly analyzing AMS data in real time, making it suitable for applications where rapid source apportionment of AMS OA spectra is desirable.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/iscmi56532.2022.10068483
Loan Repayment Prediction Using Logistic Regression Ensemble Learning With Machine Learning Algorithms
  • Nov 26, 2022
  • Thuan Nguyen Dinh + 1 more

Lending activities are an important part of the credit activities of financial institutions and banks. This is an area that brings great potential for development as well as a sustainable source of profit for financial institutions and banks. However, lending to customers also brings high risks. Therefore, predicting the ability to repay on time and understanding the factors affecting the repayment ability of customers is extremely important and necessary, to help financial institutions and banks enhance their ability to pay debts. customers' ability to identify and pay debts on time, contributing to minimizing bad debts and enhancing credit risk management. In this study, Machine Learning models will be used: Proposing a method to combine Logistic Regression with Random Forest, Logistic Regression with K-Nearest Neighbor, Logistic Regression with Support Vector Machine, Logistic Regression with Artificial Neural Network, Logistic Regression with Long short-term memory and finally Logistic Regression with Decision Tree to predict customers' ability to repay on time and compare and evaluate the performance of Machine Learning models. As a result, the Logistic Regression with the Random Forest model ensemble is found as the optimal predictive model and it is expected that Fico Score and annual income significantly influence the forecast.

  • Research Article
  • Cite Count Icon 1
  • 10.37394/23203.2021.16.64
Creditor Classification Logistic Regression Ensemble Boosting And Logistic Regression In Creditor Classification With Binary Response
  • Dec 21, 2021
  • WSEAS TRANSACTIONS ON SYSTEMS AND CONTROL
  • Abela Chairunissa + 2 more

Credit risk is the risk that has the greatest opportunity to occur in banking. The number of bad loans will also affect bank performance. The banking sector needs to know whether a prospective creditor is classified as a risky person or not. The purpose of this study is to classify creditors and compare the classification results through logistic regression with the maximum likelihood model and the Boosting algorithm, especially the AdaBoost algorithm, and to select a model with the Boosting algorithm Credit Scoring aims to classify prospective creditor into two classes, namely good prospective creditor (Performing Loan) and bad prospective creditor (Non Performing Loan) based on certain characteristics. The method often used for classifying creditor is logistic regression, but this method is less robust and less accurate than data mining. Thus, there is a need for methods that provide greater accuracy. Among the methods that have been proposed is a method called Boosting, which operates sequentially by applying a classification algorithm to the reweighted version of the training data set. This study uses 5 datasets. The first dataset is secondary data originating from data on non-subsidized homeownership creditors of Bank X Malang City. While the other datasets are simulation data with many samples of 10, 500, and 1000. The results of this study indicate that ensemble boosting logistic regression is more suitable for describing binary response problems, especially creditor classification because it provides more accurate information. For high-dimensional data, which is represented by a sample size of 10, ensemble logistic regression is proven to be able to produce fairly accurate predictions with an accuracy rate of up to 80%, whereas in the logistic regression analysis the model raises N.A because many samples < many independent variables. The use of boosting is preferred because it focuses on problems that are misclassified and have a tendency to increase to higher accuracy.

  • Front Matter
  • Cite Count Icon 1
  • 10.1053/j.ajkd.2011.11.011
Genetic Risk Prediction for CKD: A Journey of a Thousand Miles
  • Dec 14, 2011
  • American Journal of Kidney Diseases
  • Jeffrey B Kopp + 1 more

Genetic Risk Prediction for CKD: A Journey of a Thousand Miles

  • Research Article
  • 10.14710/medstat.17.1.13-24
ENSEMBLE-BASED LOGISTIC REGRESSION ON HIGH-DIMENSIONAL DATA: A SIMULATION STUDY
  • Oct 14, 2024
  • MEDIA STATISTIKA
  • Tintrim Dwi Ary Widhianingsih + 2 more

Dramatic computation growth encourages big data era, which induces data size escalation in various fields. Apart from huge sample size, cases arise high-dimensional data having more feature size than its samples. High-computing power compels the usage of modern approaches to deal with this typical dataset, while in practice, common logistic regression method is yet applied due to its simplicity and explainability. Applying logistic regression on high-dimensional data arises multicollinearity, overfitting, and computational complexity issues. Logistic Regression Ensemble (Lorens) and Ensemble Logistic Regression (ELR) are the logistic-regression-based alternative methods proposed to solve these problems. Lorens adopts ensemble concept with mutually exclusive feature partitions to form several subsets of data, while ELR involves feature selection in the algorithm by drawing part of features based on probability ranking value. This paper uncovers the effectiveness of Lorens and ELR applied to high-dimensional data classification through simulation study under three different scenarios, i.e., with feature size variation, for imbalanced high-dimensional data, and under multicollinearity conditions. Our simulation study reveals that ELR outperforms Lorens and obtains more stable performance over different feature sizes and imbalanced data settings. On the other hand, Lorens achieves more reliable performance than ELR on a simulation study with a multicollinearity issue.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.