Variable selection via thresholding

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract Variable selection comprises an important step in many modern statistical inference procedures. In the regression setting, when estimators cannot shrink irrelevant signals to zero, covariates without relationships to the response often manifest small but nonzero regression coefficients. The ad hoc procedure of discarding variables whose coefficients are smaller than some threshold is often employed in practice. We formally analyze a version of such thresholding procedures and develop a simple thresholding method that consistently estimates the set of relevant variables under mild regularity assumptions. Using this thresholding procedure, we propose a sparse, ‐consistent and asymptotically normal estimator whose nonzero elements do not exhibit shrinkage. The performance and applicability of our approach are examined via numerical studies of simulated and real data.

Similar Papers
  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.csda.2019.106881
A novel Bayesian approach for variable selection in linear regression models
  • Nov 5, 2019
  • Computational Statistics & Data Analysis
  • Konstantin Posch + 2 more

A novel Bayesian approach for variable selection in linear regression models

  • Research Article
  • Cite Count Icon 8
  • 10.1080/00949655.2020.1788560
Variable selection of partially linear varying coefficient spatial autoregressive model
  • Jul 3, 2020
  • Journal of Statistical Computation and Simulation
  • Tizheng Li + 2 more

The partially linear varying coefficient spatial autoregressive model is a recently proposed semi-parametric spatial autoregressive model, in which some of the explanatory variables have varying coefficients while the remained explanatory variables possess constant ones. Although some estimation methods have been proposed for the partially linear varying coefficient spatial autoregressive model, the problem of selecting important explanatory variables in the parametric component of such model has not been addressed to date. In this paper, we propose a penalized profile least squares method to address this problem. Different from the existing estimation methods, the proposed method can simultaneously select the significant explanatory variables in the parametric component and estimate the corresponding nonzero regression coefficients. Furthermore, we provide a computationally feasible algorithm to obtain the penalized profile least squares estimator. The finite sample performance of the proposed variable selection method is evaluated through some simulation studies and illustrated by a real data example.

  • Research Article
  • Cite Count Icon 1
  • 10.1080/03610926.2023.2189059
Reducing bias and mitigating the influence of excess of zeros in regression covariates with multi-outcome adaptive LAD-lasso
  • Mar 17, 2023
  • Communications in Statistics - Theory and Methods
  • Jyrki Möttönen + 3 more

Zero-inflated explanatory variables, as opposed to outcome variables, are common, for example, in environmental sciences. In this article, we address the problem of having excess of zero values in some continuous explanatory variables, which are subject to multi-outcome lasso-regularized variable selection. In short, the problem results from the failure of the lasso-type of shrinkage methods to recognize any difference between zero value occurring either in the regression coefficient or in the corresponding value of the explanatory variable. This kind of confounding will obviously increase the number of false positives – all non-zero regression coefficients do not necessarily represent true outcome effects. We present here the adaptive LAD-lasso for multiple outcomes, which extends the earlier work of multi-outcome LAD-lasso with adaptive penalization. In addition to well-known property of having less biased regression coefficients, we show that the adaptivity also improves method’s ability to recover from influences of excess of zero values measured in continuous covariates.

  • Research Article
  • Cite Count Icon 13
  • 10.1080/10618600.2020.1840997
Bayesian Variable Selection for Gaussian Copula Regression Models
  • Dec 10, 2020
  • Journal of Computational and Graphical Statistics
  • Angelos Alexopoulos + 1 more

We develop a novel Bayesian method to select important predictors in regression models with multiple responses of diverse types. A sparse Gaussian copula regression model is used to account for the multivariate dependencies between any combination of discrete and/or continuous responses and their association with a set of predictors. We use the parameter expansion for data augmentation strategy to construct a Markov chain Monte Carlo algorithm for the estimation of the parameters and the latent variables of the model. Based on a centered parameterization of the Gaussian latent variables, we design a fixed-dimensional proposal distribution to update jointly the latent binary vectors of important predictors and the corresponding nonzero regression coefficients. For Gaussian responses and for outcomes that can be modeled as a dependent version of a Gaussian response, this proposal leads to a Metropolis-Hastings step that allows an efficient exploration of the predictors’ model space. The proposed strategy is tested on simulated data and applied to real datasets in which the responses consist of low-intensity counts, binary, ordinal and continuous variables.

  • Research Article
  • 10.1007/s10044-025-01444-7
Structured regularization with object size selection using mathematical morphology
  • Mar 27, 2025
  • Pattern Analysis and Applications
  • Disi Lin + 4 more

We propose a novel way to incorporate morphology operators through structured regularization of machine learning models. Specifically, we introduce a feature map in the models that performs structured variable selection. The feature map is automatically processed by approximate morphology operators and is learned together with the model coefficients. Experiments were conducted with linear regression on both synthetic data, demonstrating that the proposed methods are effective in selecting groups of parameters with much less noise than baseline models, and on three-dimensional T1-weighted brain magnetic resonance images (MRI) for age prediction, demonstrating that the proposed methods enforce sparsity and select homogeneous regions of non-zero and relevant regression coefficients. The proposed methods improve interpretability in pattern analysis. The minimum size of features in the structured variable selection can be controlled by adjusting the structuring element in the approximate morphology operator, tailored to the specific study of interest. With these added benefits, the proposed methods still perform on par with commonly used variable selection and structured variable selection methods in terms of the coefficient of determination and the Pearson correlation coefficient.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/allerton.2012.6483259
Group model selection using marginal correlations: The good, the bad and the Ugly
  • Oct 1, 2012
  • Waheed U Bajwa + 1 more

Group model selection is the problem of determining a small subset of groups of predictors (e.g., the expression data of genes) that are responsible for majority of the variation in a response variable (e.g., the malignancy of a tumor). This paper focuses on group model selection in high-dimensional linear models, in which the number of predictors far exceeds the number of samples of the response variable. Existing works on high-dimensional group model selection either require the number of samples of the response variable to be significantly larger than the total number of predictors contributing to the response or impose restrictive statistical priors on the predictors and/or nonzero regression coefficients. This paper provides comprehensive understanding of a low-complexity approach to group model selection that avoids some of these limitations. The proposed approach, termed Group Thresholding (GroTh), is based on thresholding of marginal correlations of groups of predictors with the response variable and is reminiscent of existing thresholding-based approaches in the literature. The most important contribution of the paper in this regard is relating the performance of GroTh to a polynomial-time verifiable property of the predictors for the general case of arbitrary (random or deterministic) predictors and arbitrary nonzero regression coefficients.

  • Research Article
  • Cite Count Icon 51
  • 10.1002/gepi.20353
Bayesian variable and model selection methods for genetic association studies
  • Jul 10, 2008
  • Genetic Epidemiology
  • Brooke L Fridley

Variable selection is growing in importance with the advent of high throughput genotyping methods requiring analysis of hundreds to thousands of single nucleotide polymorphisms (SNPs) and the increased interest in using these genetic studies to better understand common, complex diseases. Up to now, the standard approach has been to analyze the genotypes for each SNP individually to look for an association with a disease. Alternatively, combinations of SNPs or haplotypes are analyzed for association. Another added complication in studying complex diseases or phenotypes is that genetic risk for the disease is often due to multiple SNPs in various locations on the chromosome with small individual effects that may have a collectively large effect on the phenotype. Hence, multi-locus SNP models, as opposed to single SNP models, may better capture the true underlying genotypic-phenotypic relationship. Thus, innovative methods for determining which SNPs to include in the model are needed. The goal of this article is to describe several methods currently available for variable and model selection using Bayesian approaches and to illustrate their application for genetic association studies using both real and simulated candidate gene data for a complex disease. In particular, Bayesian model averaging (BMA), stochastic search variable selection (SSVS), and Bayesian variable selection (BVS) using a reversible jump Markov chain Monte Carlo (MCMC) for candidate gene association studies are illustrated using a study of age-related macular degeneration (AMD) and simulated data.

  • Research Article
  • Cite Count Icon 16
  • 10.1038/s41598-021-03278-9
Diagnostic value of baseline 18FDG PET/CT skeletal textural features in follicular lymphoma
  • Dec 1, 2021
  • Scientific Reports
  • Julie Faudemer + 5 more

At present, 18F-fluorodesoxyglucose (18FDG) positron emission tomography (PET)/computed tomography (CT) cannot be used to omit a bone marrow biopsy (BMB) among initial staging procedures in follicular lymphoma (FL). The additional diagnostic value of skeletal textural features on baseline 18FDG-PET/CT in diffuse large B-cell lymphoma (DLBCL) patients has given promising results. The aim of this study is to evaluate the value of 18FDG-PET/CT radiomics for the diagnosis of bone marrow involvement (BMI) in FL patients. This retrospective bicentric study enrolled newly diagnosed FL patients addressed for baseline 18FDG PET/CT. For visual assessment, examinations were considered positive in cases of obvious bone focal uptakes. For textural analysis, the skeleton volumes of interest (VOIs) were automatically extracted from segmented CT images and analysed using LifeX software. BMB and visual assessment were taken as the gold standard: BMB −/PET − patients were considered as bone-NEGATIVE patients, whereas BMB +/PET −, BMB −/PET + and BMB +/PET + patients were considered bone-POSITIVE patients. A LASSO regression algorithm was used to select features of interest and to build a prediction model. Sixty-six consecutive patients were included: 36 bone-NEGATIVE (54.5%) and 30 bone-POSITIVE (45.5%). The LASSO regression found variance_GLCM, correlation_GLCM, joint entropy_GLCM and busyness_NGLDM to have nonzero regression coefficients. Based on ROC analysis, a cut-off equal to − 0.190 was found to be optimal for the diagnosis of BMI using PET pred.score. The corresponding sensitivity, specificity, PPV and NPV values were equal to 70.0%, 83.3%, 77.8% and 76.9%, respectively. When comparing the ROC AUCs with using BMB alone, visual PET assessment or PET pred.score, a significant difference was found between BMB versus visual PET assessments (p = 0.010) but not between BMB and PET pred.score assessments (p = 0.097). Skeleton texture analysis is worth exploring to improve the performance of 18FDG-PET/CT for the diagnosis of BMI at baseline in FL patients.

  • Research Article
  • Cite Count Icon 23
  • 10.3389/fonc.2021.657002
Prognostic Value of Eight-Gene Signature in Head and Neck Squamous Carcinoma
  • Jun 18, 2021
  • Frontiers in Oncology
  • Baoling Liu + 6 more

Head and neck cancer (HNC) is the fifth most common cancer worldwide. In this study, we performed an integrative analysis of the discovery set and established an eight-gene signature for the prediction of prognosis in patients with head and neck squamous cell carcinoma (HNSCC). Univariate Cox analysis was used to identify prognosis-related genes (with P < 0.05) in the GSE41613, GSE65858, and TCGA-HNSC RNA-Seq datasets after data collection. We performed LASSO Cox regression analysis and identified eight genes (CBX3, GNA12, P4HA1, PLAU, PPL, RAB25, EPHX3, and HLF) with non-zero regression coefficients in TCGA-HNSC datasets. Survival analysis revealed that the overall survival (OS) of GSE41613 and GSE65858 datasets and the progression-free survival(DFS)of GSE27020 and GSE42743 datasets in the low-risk group exhibited better survival outcomes compared with the high-risk group. To verify that the eight-mRNA prognostic model was independent of other clinical features, KM survival analysis of the specific subtypes with different clinical characteristics was performed. Univariate and multivariate Cox regression analyses were used to identify three independent prognostic factors to construct a prognostic nomogram. Finally, the GSVA algorithm identified six pathways that were activated in the intersection of the TCGA-HNSC, GSE65858, and GSE41613 datasets, including early estrogen response, cholesterol homeostasis, oxidative phosphorylation, fatty acid metabolism, bile acid metabolism, and Kras signaling. However, the epithelial–mesenchymal transition pathway was inhibited at the intersection of the three datasets. In conclusion, the eight-gene prognostic signature proved to be a useful tool in the prognostic evaluation and facilitate personalized treatment of HNSCC patients.

  • Research Article
  • Cite Count Icon 2
  • 10.4251/wjgo.v14.i9.1823
Construction and analysis of an ulcer risk prediction model after endoscopic submucosal dissection for early gastric cancer
  • Sep 15, 2022
  • World Journal of Gastrointestinal Oncology
  • San-Dong Gong + 3 more

BACKGROUNDEndoscopic submucosal dissection (ESD) has been widely used in the treatment of early gastric cancer (EGC). A personalized and effective prediction method for ESD with EGC is urgently needed.AIMTo construct a risk prediction model for ulcers after ESD for EGC based on LASSO regression.METHODSA total of 196 patients with EGC who received ESD treatment were prospectively selected as the research subjects and followed up for one month. They were divided into an ulcer group and a non-ulcer group according to whether ulcers occurred. The general data, pathology, and endoscopic characteristics of the groups were compared, and the best risk predictor subsets were screened by LASSO regression and tenfold cross-validation. Multivariate logistic regression was applied to analyze the risk factors for ulcers after ESD in patients with EGC. A receiver operating characteristic (ROC) curve was used to estimate the predictive model performance.RESULTSOne month after the operation, no patient was lost to follow-up. The incidence of ulcers was 20.41% (40/196) (ulcer group), and the incidence of no ulcers was 79.59% (156/196) (non-ulcer group). There were statistically significant differences in the course of disease, Helicobacter pylori infection history, smoking history, tumor number, clopidogrel medication history, lesion diameter, infiltration depth, convergent folds, and mucosal discoloration between the groups. Gray's medication history, lesion diameter, convergent folds, and mucosal discoloration, which were the 4 nonzero regression coefficients, were screened by LASSO regression analysis. Further multivariate logistic analysis showed that lesion diameter [Odds ratios (OR) = 30.490, 95%CI: 8.584-108.294], convergent folds (OR = 3.860, 95%CI: 1.060-14.055), mucosal discoloration (OR = 3.191, 95%CI: 1.016-10.021), and history of clopidogrel (OR = 3.554, 95%CI: 1.009-12.515) were independent risk factors for ulcers after ESD in patients with EGC (P < 0.05). The ROC curve showed that the area under the curve of the risk prediction model for ulcers after ESD in patients with EGC was 0.944 (95%CI: 0.902-0.972).CONCLUSIONClopidogrel medication history, lesion diameter, convergent folds, and mucosal discoloration can predict the occurrence of ulcers after ESD in patients with EGC.

  • Conference Article
  • Cite Count Icon 35
  • 10.1109/icdm.2008.51
Exploiting Local and Global Invariants for the Management of Large Scale Information Systems
  • Dec 1, 2008
  • Haifeng Chen + 3 more

This paper presents a data oriented approach to modeling the complex computing systems, in which an ensemble of correlation models are discovered to represent the system status. If the discovered correlations can continually hold under different user scenarios and workloads, they are regarded as invariants of the information system. In our previous work, we have developed an algorithm to automatically search the invariants between any pair of system attributes, which we call local invariants. However that method is unable to deal with the high order dependency models due to the combinatorial explosion of search space. In this paper we use Bayesian regression technique to discover those high order correlation models, called global invariants. We treat each attribute as a response variable in turn and express its dependency with the other attributes in a regression model. By adding the prior constraint of Laplacian distribution to the regression coefficients, we can find the solution in which only the correlated attributes with respect to the response have nonzero regression coefficients. After that we further consider the temporal dependencies of those extracted attributes by incorporating their past observations. We also provide a confidence metric and a validation procedure to measure the reliability of learned models. If the model does not break down in the validation, it is regarded as a true invariant of the system. Experimental results on a real wireless networking system show that the discovered invariants can be used to effectively detect system failures as well as provide valuable information about the failure source.

  • Conference Article
  • 10.1164/ajrccm-conference.2021.203.1_meetingabstracts.a3773
Diagnosis of COVID-19 by Exhaled Breath Analysis Using Gas Chromatography Mass Spectrometry
  • May 1, 2021
  • W Ibrahim + 9 more

Background: A novel human coronavirus, also known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV2), emerged in China in late 2019 and has since claimed more than one million lives. COVID-19 infection is perceived to be seasonally recurrent and a rapid non-invasive biomarker to accurately diagnose patients early-on in their disease course will be necessary to meet the operational demands for COVID-19 control in the coming years.Objective: To evaluate the role of exhaled breath volatile breath biomarkers in identifying patients with suspected or confirmed COVID 19 infection, based on their underlying reverse transcriptase polymerase chain reaction (RT-PCR) status. Methods: We conducted an observational study at Glenfield Hospital, Leicester, United Kingdom, recruiting adult patients with suspected or confirmed COVID19 pneumonia. Breath samples were collected using a standard breath collection bag, modified with appropriate filters to comply with local infection control recommendations and samples were analysed using gas chromatography mass spectrometry (GC-MS).Findings: 81 patients were recruited, of whom 52/81 (64%) have subsequently tested positive for COVID19. A LASSO regression analysis, with the dependent variable as PCR status was run. A set of seven features were extracted that had non-zero regression coefficients in at least 70 out of 100 runs of 10-fold cross validation. Compound identities were confirmed using the Metabolomics Standards Initiative (MSI). These were benzaldehyde, 1-propanol (MSI level 1), 3,6-methylundecane (MSI level 2), camphene and beta-cubebene (MSI level 1 and 2 respectively). Iodobenzene was also extracted, likely of exogenous origin, and an unidentified compound. A logistic regression model was fitted with the dependent variable as PCR status and independent variables as the seven features selected by the LASSO model. Partial Least Squares Discriminant Analysis (PLSDA) and Principal Component Analysis (PCA) were applied to the seven features, with the dependent variable as PCR status. The AUC for the first discriminant function score was 0.836 (95% CI: 0.745-0.928), Sensitivity was 0.68 (95% CI 0.551-0.809), Specificity was 0.857 (95% CI 0.728-0.987), positive predictive value (PPV) was 0.895 (95% CI 0.797-0.992) and negative predictive value (NPV) was 0.6 (95% CI 0.448-0.752). The AUC for the first PCA was 0.799 (95% CI: 0.698-0.900), Sensitivity was 0.7 (95% CI 0.573-0.827), Specificity was 0.786 (95% CI 0.634-0.938), PPV was 0.854 (95% CI 0.745-0.962) and NPV was 0.595 (95% CI 0.436-0.753).Conclusions: breath analysis has promising combined sensitivity and specificity in detecting COVID19, raising the possibility of mass rapid testing, pending external validation of the identified biomarkers.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s12561-020-09284-1
Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data.
  • Jun 17, 2020
  • Statistics in biosciences
  • Takumi Saegusa + 4 more

The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non- adherence.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.stamet.2012.11.003
Modified SEE variable selection for varying coefficient instrumental variable models
  • Nov 27, 2012
  • Statistical Methodology
  • Peixin Zhao + 1 more

Modified SEE variable selection for varying coefficient instrumental variable models

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.4236/ojs.2014.41005
Automatic Variable Selection for High-Dimensional Linear Models with Longitudinal Data
  • Jan 1, 2014
  • Open Journal of Statistics
  • Ruiqin Tian + 1 more

High-dimensional longitudinal data arise frequently in biomedical and genomic research. It is important to select relevant covariates when the dimension of the parameters diverges as the sample size increases. We consider the problem of variable selection in high-dimensional linear models with longitudinal data. A new variable selection procedure is proposed using the smooth-threshold generalized estimating equation and quadratic inference functions (SGEE-QIF) to incorporate correlation information. The proposed procedure automatically eliminates inactive predictors by setting the corresponding parameters to be zero, and simultaneously estimates the nonzero regression coefficients by solving the SGEE-QIF. The proposed procedure avoids the convex optimization problem and is flexible and easy to implement. We establish the asymptotic properties in a high-dimensional framework where the number of covariates increases as the number of cluster increases. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedure.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.