Sparse Regression Problem Research Articles

This dissertation deals with three closely related topics of the lasso in addition to supplying a comprehensive overview of the rapidly growing literature in this field. The first part aims at improving the lasso to attain smaller predictor error while simultaneously keeping the model sparsity. We propose the data-augmented weighted lasso (DAWL) to make a natural combination of the lasso and other estimators like ridge regression. We investigate the data-augmentation starting from the ridge's nature in solving the singularity problem which successfully explains the reasonability of the elastic net, and from a non-asymptotic study of the lasso's variable selection which describes the roles of different parts of the Gram matrix played in lasso estimation and selection. A robust data-dependent scaling and a 'ranged lasso' are proposed to augment both the regression matrix (nondiagonally) and the response vector. In the discussions of weights, we prove a sharp oracle inequality for the weighted lasso in the orthogonal case, and propose z-value based weights with good asymptotics. Simulations show the advantages of DAWL in test error and sparsity. The second topic is the study of a generic sparse regression problem with a customizable sparsity pattern matrix, motivated by, but not limited to, a supervised gene clustering problem in microarray data analysis. The 'clustered lasso' method is proposed with l1-type penalties on both the coefficients and their pairwise differences. Somewhat surprisingly, it shows a quite different behavior than the lasso or the fused lasso; the granted power of the l1-penalty to approximate the l0-penalty seems specious in this situation. This leads us to a theoretical study of the power and limitations, of the l1-penalty in the genera framework of sparse regression. We then discuss how to combine data-augmentation and weights to improve the naive l1-penalty. To attack the challenging computation problem in high-dimensional space, we successfully generalize an iterative algorithm for solving the lasso and develop a novel accelerated 'annealing' algorithm with theoretical justifications. It applies to any sparse regression like the fused/clustered lasso, and can handle a large design matrix as well as a large sparsity pattern matrix with apparent ease. In the third part, we discuss a class of thresholding-based iterative selection procedures (TISP) for model selection and shrinkage. People have long before noticed the weakness of the convex l1-constraint (or the soft-thresholding) in wavelets and have designed many different forms of nonconvex penalties to increase model sparsity and accuracy. But for a nonorthogonal regression matrix, there is great difficulty in both investigating the performance in theory and solving the problem in computation. TISP provides a simple and efficient way to tackle this so that we successfully borrow the rich results in the orthogonal design to solve the (nonconvexly) penalized regression for a general design matrix. Our starting point is, however, the thresholding rules rather than the penalty functions. Indeed, there is a universal connection between them. But a drawback of the latter is its non-unique form, and our way greatly facilitates the computation and the analysis. In fact, we are able to build the convergence theorem and explore the theoretical properties of the selection and the estimation via TISP nonasymptotically. More importantly, a novel Hybrid-TISP is proposed based on hard-thresholding and ridge-thresholding. It provides a fusion between the l0-penalty and the l2-penalty, and adaptively achieves the right balance between shrinkage and selection in statistical modeling. In practice, Hybrid-TISP shows superiorperformance in test-error and is parsimonious.

BackgroundA genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data.ResultsThe structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification.ConclusionA statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.

Sparse Regression Problem Research Articles

Related Topics

Articles published on Sparse Regression Problem

Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels

Sound Field Reproduction using the Lasso

Sparsity-Aware Estimation of CDMA System Parameters

Sparse regression with exact clustering

Sparse regression using mixed norms

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data

Data-driven facial animation based on manifold Bayesian regression

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sparse Regression Problem Research Articles

Related Topics

Articles published on Sparse Regression Problem

Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels

Sound Field Reproduction using the Lasso

Sparsity-Aware Estimation of CDMA System Parameters

Sparse regression with exact clustering

Sparse regression using mixed norms

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data

Data-driven facial animation based on manifold Bayesian regression