Sparse constrained and unconstrained non-symmetric correspondence analysis
Abstract In this paper, we propose to regularize non-symmetric correspondence analysis (NSCA) and its canonical variant by employing LASSO and group LASSO penalties. NSCA visualizes the asymmetric association structure of a categorical predictor variable and a categorical response variable through a biplot with points for the predictor categories and vectors for the response categories. In canonical NSCA, external information is available about the categories of the predictor variable and this information is used to linearly constrain the coordinates of the points. When the number of predictor categories is large or when the number of external variables is large, this leads to problems in terms of interpretation and/or estimation. To avoid these problems, we propose to use a LASSO or group LASSO penalty on the parameters. Such penalties shrink the parameters to zero, offering a sparse solution. Therefore, we first cast (constrained) NSCA as a least squares estimation problem and then add the penalty to the least squares loss function. We derive a Majorization-Minimization algorithm to minimize this loss function. A bootstrap procedure is proposed for model selection, that is, determining the optimal dimensionality and optimal value of the penalty parameter. The procedures are illustrated using two empirical data sets, one for constrained (i.e., canonical) NSCA, and one for unconstrained NSCA. We discuss in detail the model selection procedure and the interpretation of the selected model.
19
- 10.1002/widm.1198
- Feb 9, 2017
- WIREs Data Mining and Knowledge Discovery
264
- 10.1080/01621459.1971.10482297
- Sep 1, 1971
- Journal of the American Statistical Association
1841
- 10.1201/b18401
- May 7, 2015
717
- 10.1017/cbo9781316576533
- Jul 5, 2016
50
- 10.1186/1471-2105-12-448
- Nov 15, 2011
- BMC Bioinformatics
144
- 10.1002/9781118762875
- Aug 29, 2014
9
- 10.1007/s00357-011-9070-3
- Jan 12, 2011
- Journal of Classification
17703
- 10.1111/j.1467-9868.2005.00503.x
- Mar 9, 2005
- Journal of the Royal Statistical Society Series B: Statistical Methodology
10
- 10.1111/j.1467-842x.2009.00564.x
- Feb 17, 2010
- Australian & New Zealand Journal of Statistics
1741
- 10.1198/0003130042836
- Feb 1, 2004
- The American Statistician
- Research Article
12
- 10.1007/s11634-009-0054-7
- Nov 14, 2009
- Advances in Data Analysis and Classification
Nonsymmetric correspondence analysis (NSCA) aims to examine predictive relationships between rows and columns of a contingency table. The predictor categories of such tables are often accompanied by some auxiliary information. Constrained NSCA (CNSCA) incorporates such information as linear constraints on the predictor categories. However, imposing constraints also means that part of the predictive relationship is left unaccounted for by the constraints. A method of NSCA is proposed for analyzing the residual part along with the part accounted for by the constraints. The CATANOVA test may be invoked to test the significance of each part. The two tests parallel the distinction between tests of ignoring and eliminating, and help gain some insight into what is known as Simpson’s paradox in the analysis of contingency tables. Two examples are given to illustrate the distinction.
- Research Article
21
- 10.1016/j.csda.2008.09.004
- Sep 7, 2008
- Computational Statistics & Data Analysis
Regularized nonsymmetric correspondence analysis
- Research Article
14
- 10.1109/tase.2019.2941167
- Oct 18, 2019
- IEEE Transactions on Automation Science and Engineering
Learning the relationship between a response variable (e.g., a quality characteristic) and a set of predictors (e.g., process variables) is of special importance in process modeling, prediction, and optimization. In many applications, not only is the number of these variables large but these variables are also high-dimensional (HD) (e.g., they are represented by waveform signals). This high dimensionality requires a systematic approach to both modeling the relationship between the variables and removing the noninformative input variables. This article proposes a functional regression method in which an HD response is estimated and predicted through a set of informative and noninformative HD covariates. For this purpose, the functional regression coefficients are expanded through a set of low-dimensional smooth basis functions. In order to estimate the low-dimensional set of parameters, a penalized loss function with both smoothing and group lasso penalties is defined. The block coordinate decent (BCD) method is employed to develop a computationally tractable algorithm for minimizing the loss function. Through simulations and case studies, the performance of the proposed method is evaluated and compared with benchmarks. The results illustrate the advantage of the proposed method over the benchmarks. Note to Practitioners —This article proposes a method for efficient and interpretable modeling of processes with high-dimensional (HD) data, such as waveform signals. Especially, the proposed method generates a regression model that predicts a function (e.g., a sensor’s readings over time) using several functional inputs. Existing functional regression techniques are mostly limited to a single functional input and are focused on profile data. In many applications, however, a large number of process variables are available for estimating an HD output, such as an image. This article addresses these problems by employing basis functions to reduce the dimension of functions and introducing specific penalties that removes noninformative inputs and improves computational efficiency. A model generated by the proposed approach can be used for process monitoring and optimization. Using simulation and case studies, the performance of the developed method is evaluated and compared with other methods under various scenarios. This can provide practitioners with useful guidelines for selecting an appropriate method for process modeling.
- Research Article
- 10.11648/j.ijtam.20190502.11
- Jan 1, 2019
- International Journal of Theoretical and Applied Mathematics
Linear model (LM) provide the advance in regression analysis, where it was considered an important statistical development of the last fifty years, following general linear model (GLM), principal component analysis (PCA) and constrained principal component analysis (CPCA) in the last thirty years. This paper introduce a series of papers prepared within the framework of an international workshop. Firstly, the LM and GLM has been discussed. Next, an overview of PCA has been presented. Then constrained principal component has been shown. Some of its special cases such as PCA, Canonical correlation analysis (CANO), Redundancy analysis (RA), Correspondence analysis (CA), Growth curve models (GCM), Extended growth curve models (ExGCM), Canonical discriminant analysis (CDA), Constrained correspondence analysis, non-symmetric correspondence analysis, Multiple Set CANO, Multiple Correspondence Analysis, Vector Preference Models, Seemingly unrelated regression (SUR), Weighted low rank approximations, Two-Way canonical decomposition with linear constraints, and Multilevel RA has been noted in this paper. Related methods and ordinary least squares (OLS) estimator as a special case form CPCA has been introduced. Finally, an example has been introduced to indicate the importance of CPCA and the different between PCA and CPCA. Where CPCA is a method for structural analysis of multivariate data that combine features of regression analysis and principal component analysis. In this method, the original data first decomposed into several components according to external information. The components then subjected to principal component analysis to explore structures within the components.
- Peer Review Report
3
- 10.7554/elife.75600.sa2
- Feb 23, 2022
Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here, we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least fourfold faster inference run time relative to standard imputation tools.
- Research Article
1
- 10.1007/s11634-023-00573-3
- Dec 15, 2023
- Advances in Data Analysis and Classification
In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.
- Research Article
- 10.1360/n012013-00101
- Sep 1, 2015
- SCIENTIA SINICA Mathematica
Computing the regularization paths of generalized linear models (GLM) with group LASSO penalty can be decomposed into two problems: Selecting the path parameter and computing group LASSO solution ???1705-01??? (λ) given λ. In practice, grid method is usually used to solve the first one and coordinate descent algorithm based on the first order Taylor expansion of loss function of GLM is then used to solve the second. This paper aims at proposing algorithms that solve these two problems more efficiently. Firstly, we give a path following algorithm that attempts to find the λs that correspond to the change of active set. Secondly, we take advantage of the properties of GLM, and use second-order, instead of first-order, Taylor approximation of the loss function of GLM in coordinate descent method to achieve better precision in less time. Simulated and real data sets show that our algorithm is capable of efficiently pinpointing the critical λs that pair with changes of active set and that our proposed coordinate descent algorithm based on second-order approximation is competitive in speed compared with that based on second-order approximation.
- Research Article
11
- 10.1080/00224065.2020.1805380
- Aug 8, 2020
- Journal of Quality Technology
Directed graphical models aim to represent the probabilistic relationships between variables in a system. Learning a directed graphical model from data includes parameter learning and structure learning. Several methods have been developed for directed graphical models with scalar variables. However, the case in which the variables are infinite-dimensional has not been studied thoroughly. Nowadays, in many applications, the variables are infinite-dimensional signals that need to be treated as functional random variables. This article proposes a novel method to learn directed graphical models in the functional setting. When the structure of the graph is known, function-to-function linear regression is used to estimate the parameters of the graph. When the goal is to learn the structure, a penalized least square loss function with a group LASSO penalty, for variable selection, and an L 2 penalty, to handle group selection of nodes, is defined. Cyclic coordinate accelerated proximal gradient descent algorithm is employed to minimize the loss function and learn the structure of the directed graph. Through simulations and a case study, the advantage of the proposed method is proven.
- Research Article
1
- 10.1007/bf01085923
- Dec 1, 1981
- Journal of Soviet Mathematics
A definition is given for stability of the problem of statistical estimation. It is shown that if a difference loss function is used and if the symmetric statistics form a complete class of estimates, then the problem of statistical estimation is unstable. Examples are given of nondifference loss functions for which the problem of estimation is stable.
- Research Article
2
- 10.1080/10485252.2012.661054
- Apr 30, 2012
- Journal of Nonparametric Statistics
Semiparametric regression models with multiple covariates are commonly encountered. When there are covariates that are not associated with a response variable, variable selection may lead to sparser models, more lucid interpretations and more accurate estimation. In this study, we adopt a sieve approach for the estimation of nonparametric covariate effects in semiparametric regression models. We adopt a two-step iterated penalisation approach for variable selection. In the first step, a mixture of Lasso and group Lasso penalties are employed to conduct the first-round variable selection and obtain the initial estimate. In the second step, a mixture of weighted Lasso and weighted group Lasso penalties, with weights constructed using the initial estimate, are employed for variable selection. We show that the proposed iterated approach has the variable selection consistency property, even when the number of unknown parameters diverges with sample size. Numerical studies, including simulation and analysis of a diabetes data set, show satisfactory performance of the proposed approach.
- Research Article
23
- 10.3150/11-bej364
- Aug 1, 2012
- Bernoulli
We define the group-lasso estimator for the natural parameters of the exponential families of distributions representing hierarchical log-linear models under multinomial sampling scheme. Such estimator arises as the solution of a convex penalized likelihood optimization problem based on the group-lasso penalty. We illustrate how it is possible to construct an estimator of the underlying log-linear model using the blocks of nonzero coefficients recovered by the group-lasso procedure. We investigate the asymptotic properties of the group-lasso estimator as a model selection method in a double-asymptotic framework, in which both the sample size and the model complexity grow simultaneously. We provide conditions guaranteeing that the group-lasso estimator is model selection consistent, in the sense that, with overwhelming probability as the sample size increases, it correctly identifies all the sets of nonzero interactions among the variables. Provided the sequences of true underlying models is sparse enough, recovery is possible even if the number of cells grows larger than the sample size. Finally, we derive some central limit type of results for the log-linear group-lasso estimator.
- Research Article
3
- 10.1016/j.csda.2021.107412
- Dec 15, 2021
- Computational Statistics & Data Analysis
A likelihood-based boosting algorithm for factor analysis models with binary data
- Research Article
2
- 10.1002/sta4.123
- Jan 1, 2016
- Stat
This paper studies the asymptotic properties of the penalized least squares estimator using an adaptive group Lasso penalty for the reduced rank regression. The group Lasso penalty is defined in the way that the regression coefficients corresponding to each predictor are treated as one group. It is shown that under certain regularity conditions, the estimator can achieve the minimax optimal rate of convergence. Moreover, the variable selection consistency can also be achieved, that is, the relevant predictors can be identified with probability approaching one. In the asymptotic theory, the number of response variables, the number of predictors and the rank number are allowed to grow to infinity with the sample size. Copyright © 2016 John Wiley & Sons, Ltd.
- Research Article
7
- 10.1016/j.ympev.2021.107086
- Feb 18, 2021
- Molecular Phylogenetics and Evolution
Assessing topological congruence among concatenation-based phylogenomic approaches in empirical datasets
- Research Article
14
- 10.1007/s11336-002-0998-4
- Sep 1, 2005
- Psychometrika
For the exploratory analysis of a matrix of proximities or (dis)similarities between objects, one often uses cluster analysis (CA) or multidimensional scaling (MDS). Solutions resulting from such analyses are sometimes interpreted using external information on the objects. Usually the procedures of CA, MDS and using external information are carried out independently and sequentially, although combinations of two of the three procedures (CA and MDS, or MDS and using external information) have been proposed in the literature. The present paper offers a procedure that combines all three procedures in one analysis, using a model that describes a partition of objects with cluster centroids represented in a low-dimensional space, which in turn is related to the information in the external variables. A simulation study is carried out to demonstrate that the method works satisfactorily for data with a known underlying structure. Also, to illustrate the method, it is applied to two empirical data sets.
- New
- Research Article
- 10.1007/s11634-025-00659-0
- Nov 25, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00660-7
- Nov 17, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00655-4
- Oct 18, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00650-9
- Jul 19, 2025
- Advances in Data Analysis and Classification
- Addendum
- 10.1007/s11634-025-00648-3
- Jun 26, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00646-5
- Jun 23, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00651-8
- Jun 22, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00639-4
- Jun 12, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00643-8
- Jun 6, 2025
- Advances in Data Analysis and Classification
- Research Article
- 10.1007/s11634-025-00649-2
- Jun 1, 2025
- Advances in Data Analysis and Classification
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.