Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Data-driven discovery of functional materials: LARS–LASSO logistic regression for QSAR/QSPR design of compounds with anti-COVID-19 and other activities

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The possibility of using the L1-regularization to obtain logistic classification equations of quantitative/qualitative structure-activity/property relationships (QSAR/QSPR) have been investigated. The least angle regression (LARS) of least absolute shrinkage and selection operator (LASSO) variant has been implemented in the logistic regression. The method was used for building simple classification functions for three tasks: to evaluate basicity of different organic compounds towards Li+ cation, to study binding affinity to the estrogen receptor of various organic molecules, and to predict activity against COVID-19 main protease. The obtained simple classification functions have satisfactory prognostic properties. The obtained results provide a foundation for the investigation of the electronic and spatial structures of potential ligands exhibiting the desired activity. A comparative analysis of chemoinformatics approaches facilitates the optimization of lead identification methodologies.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.4236/ojs.2020.101009
Variable Selection via Biased Estimators in the Linear Regression Model
  • Jan 1, 2020
  • Open Journal of Statistics
  • Manickavasagar Kayanan + 1 more

Least Absolute Shrinkage and Selection Operator (LASSO) is used for variable selection as well as for handling the multicollinearity problem simultaneously in the linear regression model. LASSO produces estimates having high variance if the number of predictors is higher than the number of observations and if high multicollinearity exists among the predictor variables. To handle this problem, Elastic Net (ENet) estimator was introduced by combining LASSO and Ridge estimator (RE). The solutions of LASSO and ENet have been obtained using Least Angle Regression (LARS) and LARS-EN algorithms, respectively. In this article, we proposed an alternative algorithm to overcome the issues in LASSO that can be combined LASSO with other exiting biased estimators namely Almost Unbiased Ridge Estimator (AURE), Liu Estimator (LE), Almost Unbiased Liu Estimator (AULE), Principal Component Regression Estimator (PCRE), r-k class estimator and r-d class estimator. Further, we examine the performance of the proposed algorithm using a Monte-Carlo simulation study and real-world examples. The results showed that the LARS-rk and LARS-rd algorithms, which are combined LASSO with r-k class estimator and r-d class estimator, outperformed other algorithms under the moderated and severe multicollinearity.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s00521-012-1189-6
Kernelized LARS–LASSO for constructing radial basis function neural networks
  • Sep 28, 2012
  • Neural Computing and Applications
  • Quan Zhou + 3 more

Model structure selection is of crucial importance in radial basis function (RBF) neural networks. Existing model structure selection algorithms are essentially forward selection or backward elimination methods that may lead to sub-optimal models. This paper proposes an alternative selection procedure based on the kernelized least angle regression (LARS)–least absolute shrinkage and selection operator (LASSO) method. By formulating the RBF neural network as a linear-in-the-parameters model, we derive a l 1-constrained objective function for training the network. The proposed algorithm makes it possible to dynamically drop a previously selected regressor term that is insignificant. Furthermore, inspired by the idea of LARS, the computing of output weights in our algorithm is greatly simplified. Since our proposed algorithm can simultaneously conduct model structure selection and parameter optimization, a network with better generalization performance is built. Computational experiments with artificial and real world data confirm the efficacy of the proposed algorithm.

  • Research Article
  • Cite Count Icon 2
  • 10.21009/jsa.06104
Penerapan Regresi Least Absolute Shrinkage And Selection Operator (LASSO) Untuk Mengidentifikasi Variabel Yang Berpengaruh Terhadap Kejadian Stunting di Indonesia
  • Jun 30, 2022
  • Jurnal Statistika dan Aplikasinya
  • Tesa Trilonika Pardede

Linear regression analysis is an analytical method that can be used to analyze data and draw meaningful conclusions about the dependence of one variable on another variable. In linear regression analysis there are several assumptions that must be met, namely normal distribution, there is no correlation between errors. There are several obstacles that cause the assumption to be unfulfilled, for example the occurrence of correlations between independent variables (multicollinearity). The analysis in this study uses the Least Absolute Shrinkage And Selection Operator (LASSO) regression method with the Least Angle Regression (LAR) algorithm because the stunting data in Indonesia has multicollinearity problems among the independent variables used. LASSO which can solve the case of multicollinearity in the regression at the same time it is possible to reduce the regression coefficient from the highly correlated independent variable to exactly zero. The LASSO coefficient obtained uses quadratic so that the LAR algorithm is used which is more efficient in LASSO computing. Based on the analysis that has been carried out, it is concluded that the variables of exclusive breastfeeding (X1), protein consumption (X2), DPT-HB exercise (X5), maternal height (X8) and diarrhea (X9) had an effect on stunting in Indonesia in 2018.

  • Research Article
  • Cite Count Icon 4
  • 10.47352/jmans.2774-3047.251
Performance of Ridge Regression, Least Absolute Shrinkage and Selection Operator, and Elastic Net in Overcoming Multicollinearity
  • Feb 23, 2025
  • Journal of Multidisciplinary Applied Natural Science
  • Dewi Retno Sari Saputro + 2 more

Multicollinearity is a violation of assumptions in multiple linear regression analysis that can occur if there is a high correlation between the independent variables. Likewise, the variants of multiple linear regression models such as the Geographically Weighted Regression model (GWR). Multicollinearity causes parameter estimation using the Quadratic Method (QM) unstable and produces a large variance. On the other hand, what is expected in the estimation parameters is an estimate with a minimum variance, even though it is biased. Thus, one way to overcome multicollinearity can be to use biased estimators, such as Ridge Regression (RR), Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic Net (EN). In RR, the Least Square Method (LSM) coefficient is reduced to zero but it can’t select the independent variable. However, the parameter model obtained from the Ridge Regression is biased, and the variance of the resulting regression coefficients is relatively tiny. In addition, the RR is increasingly difficult to understand if a huge number of independent variables are used. Meanwhile, LASSO is a computational method that uses quadratic programming and can act out the RR principles and perform variable selection. The LASSO method became known after discovering the Least-Angle Regression (LARS) algorithm. The LASSO method can reduce the LSM coefficient to zero to perform variable selection. LASSO also has a weakness, so EN is used. In this article, the performance of the three methods is compared from the mathematical aspect. The performance of each is written as follows, RR is helpful for clustering effects, where collinear features can be selected together; LASSO is proper for feature selection when the dataset has features with poor predictive power and EN combines LASSO and RR, which has the potential to lead to simple and predictive models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s41237-024-00237-2
Least angle regression in tangent space and LASSO for generalized linear models
  • Aug 8, 2024
  • Behaviormetrika
  • Yoshihiro Hirose

This study proposes sparse estimation methods for the generalized linear models, which run one of least angle regression (LARS) and least absolute shrinkage and selection operator (LASSO) in the tangent space of the manifold of the statistical model. This study approximates the statistical model and subsequently uses exact calculations. LARS was proposed as an efficient algorithm for parameter estimation and variable selection for the normal linear model. The LARS algorithm is described in terms of Euclidean geometry regarding the correlation as the metric of the parameter space. Since the LARS algorithm only works in Euclidean space, we transform a manifold of the statistical model into the tangent space at the origin. In the generalized linear regression, this transformation allows us to run the original LARS algorithm for the generalized linear models. The proposed methods are efficient and perform well. Real-data analysis indicates that the proposed methods output similar results to that of the l1-regularized maximum likelihood estimation for the aforementioned models. Numerical experiments reveal that our methods work well and they may be better than the l1-regularization in generalization, parameter estimation, and model selection.

  • Research Article
  • Cite Count Icon 1
  • 10.1108/gs-03-2021-0039
Data-driven structure selection for the grey NGMC(1,N) model
  • Sep 20, 2021
  • Grey Systems: Theory and Application
  • Dang Luo + 1 more

PurposeWith the prosperity of grey extension models, the form and structure of grey forecasting models tend to be complicated. How to select the appropriate model structure according to the data characteristics has become an important topic. The purpose of this paper is to design a structure selection method for the grey multivariate model.Design/methodology/approachThe linear correction term is introduced into the grey model, then the nonhomogeneous grey multivariable model with convolution integral [NGMC(1,N)] is proposed. Then, by incorporating the least absolute shrinkage and selection operator (LASSO), the model parameters are compressed and estimated based on the least angle regression (LARS) algorithm.FindingsBy adjusting the values of the parameters, the NGMC(1,N) model can derive various structures of grey models, which shows the structural adaptability of the NGMC(1,N) model. Based on the geometric interpretation of the LASSO method, the structure selection of the grey model can be transformed into sparse parameter estimation, and the structure selection can be realized by LASSO estimation.Practical implicationsThis paper not only provides an effective method to identify the key factors of the agricultural drought vulnerability, but also presents a practical model to predict the agricultural drought vulnerability.Originality/valueBased on the LASSO method, a structure selection algorithm for the NGMC(1,N) model is designed, and the structure selection method is applied to the vulnerability prediction of agricultural drought in Puyang City, Henan Province.

  • Research Article
  • Cite Count Icon 1
  • 10.4081/ijas.2009.s2.168
Using LASSO to estimate marker effects for Genomic Selection
  • Jan 1, 2009
  • Italian Journal of Animal Science
  • Mario Graziano Usai + 2 more

Here we suggest a least absolute shrinkage and selection operator (LASSO) approach to estimate the marker effects for genomic selection using the least angle regression (LARS) algorithm, modified to include a cross–validation step to define the best subset of markers to involve in the model. The LASSO-LARS was tested on simulated data which consisted of 5,865 individuals and 6,000 SNPs. The last generations of this dataset were the selection candidates. Using only animals from generations prior to the candidates, three approaches to splitting the population into training and validation sets for cross-validation were evaluated. Furthermore, different sizes of the validation sample were tested. Moreover, BLUP and Bayesian methods were carried out for comparison. The most reliable cross-validation method was the random splitting of overall population with a validation sample size of 50% of the reference population. The accuracy of the GEBVs (correlation with true breeding values) in the candidate population obtained by LASSO-LARS was 0.89 with 156 explanatory SNPs. This value was higher then those obtained by using BLUP and Bayesian methods, which were 0.75 and 0.84 respectively. It was concluded that LASSO-LARS approach is a good alternative way to estimate markers effects for genomic selection.

  • Research Article
  • Cite Count Icon 186
  • 10.1017/s0016672309990334
LASSO with cross-validation for genomic selection
  • Dec 1, 2009
  • Genetics Research
  • M Graziano Usai + 2 more

We used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO-LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO-LARS was 0.89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0.75 and 0.84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO-LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO-LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.

  • Book Chapter
  • Cite Count Icon 38
  • 10.1007/11564096_18
Kernel Basis Pursuit
  • Jan 1, 2005
  • Vincent Guigue + 2 more

Estimating a non-uniformly sampled function from a set of learning points is a classical regression problem. Kernel methods have been widely used in this context, but every problem leads to two major tasks: optimizing the kernel and setting the fitness-regularization compromise.This article presents a new method to estimate a function from noisy learning points in the context of RKHS (Reproducing Kernel Hilbert Space). We introduce the Kernel Basis Pursuit algorithm, which enables us to build a ℓ1-regularized-multiple-kernel estimator. The general idea is to decompose the function to learn on a sparse-optimal set of spanning functions. Our implementation relies on the Least Absolute Shrinkage and Selection Operator (LASSO) formulation and on the Least Angle Regression (LARS) solver. The computation of the full regularization path, through the LARS, will enable us to propose new adaptive criteria to find an optimal fitness-regularization compromise. Finally, we aim at proposing a fast parameter-free method to estimate non-uniform-sampled functions.

  • Research Article
  • Cite Count Icon 10
  • 10.1109/lsp.2012.2221712
On LARS/Homotopy Equivalence Conditions for Over-Determined LASSO
  • Dec 1, 2012
  • IEEE Signal Processing Letters
  • Junbo Duan + 4 more

We revisit the positive cone condition given by Efron <etal xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"/> <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <citeref refid="ref1"/></citerefgrp> for the over-determined least absolute shrinkage and selection operator (LASSO). It is a sufficient condition ensuring that the number of nonzero entries in the solution vector keeps increasing when the penalty parameter decreases, based on which the least angle regression (LARS) <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><citeref refid="ref1"/></citerefgrp> and homotopy <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <citeref refid="ref2"/></citerefgrp> algorithms yield the same iterates. We show that the positive cone condition is equivalent to the diagonal dominance of the Gram matrix inverse, leading to a simpler way to check the positive cone condition in practice. Moreover, we elaborate on a connection between the positive cone condition and the mutual coherence condition given by Donoho and Tsaig <citerefgrp xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><citeref refid="ref3"/></citerefgrp> , ensuring the exact recovery of any <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex> </formula> -sparse representation using both LARS and homotopy.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/cmvit.2017.19
A New Sparse LSSVM Method Based the Revised LARS
  • Feb 1, 2017
  • Shuisheng Zhou + 1 more

Least squares support vector machine (LSSVM) has comparable performance with support vector machine (SVM) and it has been widely used for classification and regression problems. The solutions of LSSVM are obtained by solving linear equations, but it lack of sparseness, which result in that it is unable to handle large-scale data sets. The state-of-art method least angle regression (LARS) can obtain a sparse solution by solving the Least Absolute Shrinkage and Selection Operator (LASSO) problem. So we use the idea of the LARS to obtain the sparse solution of the LSSVM, i.e., RLARS-LSSVM is proposed, which is an efficient method. The feature of the method is to select the most important samples as support vectors iteratively and to remove the samples that are similar to the selected support vectors simultaneously. Experimental results show that the proposed method can obtain much higher test accuracy compared with other sparse LSSVM methods at the same number of support vectors.

  • Research Article
  • Cite Count Icon 15
  • 10.4137/cin.s3805
Development and Validation of Predictive Indices for a Continuous Outcome Using Gene Expression Profiles
  • Jan 1, 2010
  • Cancer Informatics
  • Yingdong Zhao + 1 more

There have been relatively few publications using linear regression models to predict a continuous response based on microarray expression profiles. Standard linear regression methods are problematic when the number of predictor variables exceeds the number of cases. We have evaluated three linear regression algorithms that can be used for the prediction of a continuous response based on high dimensional gene expression data. The three algorithms are the least angle regression (LAR), the least absolute shrinkage and selection operator (LASSO), and the averaged linear regression method (ALM). All methods are tested using simulations based on a real gene expression dataset and analyses of two sets of real gene expression data and using an unbiased complete cross validation approach. Our results show that the LASSO algorithm often provides a model with somewhat lower prediction error than the LAR method, but both of them perform more efficiently than the ALM predictor. We have developed a plug-in for BRB-ArrayTools that implements the LAR and the LASSO algorithms with complete cross-validation.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.6339/jds.2013.11(2).1073
Variable Selection in the Chlamydia Pneumoniae Lung Infection Study
  • Jul 30, 2021
  • Journal of Data Science
  • Yuan Kang + 1 more

In this study, the data based on nucleic acid amplification tech niques (Polymerase chain reaction) consisting of 23 different transcript vari ables which are involved to investigate genetic mechanism regulating chlamy dial infection disease by measuring two different outcomes of muring C. pneumonia lung infection (disease expressed as lung weight increase and C. pneumonia load in the lung), have been analyzed. A model with fewer reduced transcript variables of interests at early infection stage has been obtained by using some of the traditional (stepwise regression, partial least squares regression (PLS)) and modern variable selection methods (least ab solute shrinkage and selection operator (LASSO), forward stagewise regres sion and least angle regression (LARS)). Through these variable selection methods, the variables of interest are selected to investigate the genetic mechanisms that determine the outcomes of chlamydial lung infection. The transcript variables Tim3, GATA3, Lacf, Arg2 (X4, X5, X8 and X13) are being detected as the main variables of interest to study the C. pneumonia disease (lung weight increase) or C. pneumonia lung load outcomes. Models including these key variables may provide possible answers to the problem of molecular mechanisms of chlamydial pathogenesis.

  • Research Article
  • 10.1088/1757-899x/242/1/012110
A weighted method based on Lars algorithm
  • Sep 1, 2017
  • IOP Conference Series: Materials Science and Engineering
  • Lin Chen + 3 more

LASSO (Least Absolute Shrinkage and Selection Operator) is mainly used to realize variable selection, simultaneously its algorithm and some improved algorithm have gotten wide attention in many fields. To improve the accuracy of LASSO problem in calculating regression coefficients, this paper proposes a new algorithm based on LASR (Least Angle Regression) algorithm to change its approximation direction, which uses two weighted method (coefficient of variation method or entropy weight method) to calculate the weight of linear relationship between the independent and the dependent variables, so we can calculate a regression coefficients set of linear regression model. Compared with LARS algorithm, it can be proved that the improved algorithm mentioned in this paper has a more precise ability for prediction.

  • Dissertation
  • 10.31390/gradschool_dissertations.3847
Association Genetics for Agronomic Traits in Rice and Cloning of ALS Herbicide Resistant Genes from Coreopsis Tinctoria Nutt
  • Jun 19, 2006
  • Nengyi Zhang

We have evaluated the potential of discriminant analysis (DA) to detect candidate markers associated with twelve economically important traits in a large population of unrelated U.S. and Asian inbred lines of rice. Associated marker alleles detected by DA mapped within the same genetic intervals when compared with previous traditional QTL mapping experiments that evaluated progeny derived from various controlled crosses. New markers identified by DA suggest that the procedure can also uncover relevant genetic regions not possible by standard genetic tests. With the same dataset, we also compared different modern regression approaches for selecting molecular markers associated with the twelve agronomic traits. These methods included stepwise forward regression (SFR), least angle regression (LAR) and least absolute shrinkage and selection operator (LASSO) selection. The epistatic model based on stepwise forward regression did successfully identify several interacting loci that explained a relatively high proportion of the observed variation for all the twelve agronomically important traits. Moreover, the loci identified by the epistatic model mapped within previously known QTL regions that underscores the genetic basis of the selected markers. It was concluded that stepwise forward regression with consideration for population structure, epistatic interactions, and missing data (multiple imputation) was a robust method, compared to the general linear model, to identify markers associated with complex agronomic traits. Acetolactate synthase (ALS), also known as acetohydroxy acid synthase (AHAS), which catalyzes the first step in the biosynthesis of the branched-chain amino acids valine, leucine and isoleucine in plants, is a target of five herbicide groups, including sulfonylurea and imidazolinone. A recently discovered group of Coreopsis tinctoria Nutt. mutants from the field showed high levels of resistance to both sulfonylurea and imidazolinone herbicides. In this study the mutants were compared by chemical, genetic, and molecular analyses with “normal” or wild-type Coreopsis. A phylogenetic analysis revealed that the ALS gene can serve as a useful molecular tool for evaluating evolutionary relationships among plant species. Due to pending patent applications by the Louisiana State University Agricultural Center and restrictions of patent applications, specific results from this research cannot be presented in this dissertation.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant