Sparse constrained and unconstrained non-symmetric correspondence analysis

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract In this paper, we propose to regularize non-symmetric correspondence analysis (NSCA) and its canonical variant by employing LASSO and group LASSO penalties. NSCA visualizes the asymmetric association structure of a categorical predictor variable and a categorical response variable through a biplot with points for the predictor categories and vectors for the response categories. In canonical NSCA, external information is available about the categories of the predictor variable and this information is used to linearly constrain the coordinates of the points. When the number of predictor categories is large or when the number of external variables is large, this leads to problems in terms of interpretation and/or estimation. To avoid these problems, we propose to use a LASSO or group LASSO penalty on the parameters. Such penalties shrink the parameters to zero, offering a sparse solution. Therefore, we first cast (constrained) NSCA as a least squares estimation problem and then add the penalty to the least squares loss function. We derive a Majorization-Minimization algorithm to minimize this loss function. A bootstrap procedure is proposed for model selection, that is, determining the optimal dimensionality and optimal value of the penalty parameter. The procedures are illustrated using two empirical data sets, one for constrained (i.e., canonical) NSCA, and one for unconstrained NSCA. We discuss in detail the model selection procedure and the interpretation of the selected model.

ReferencesShowing 10 of 31 papers
  • Cite Count Icon 19
  • 10.1002/widm.1198
An introduction toMajorization‐Minimizationalgorithms for machine learning and statistical estimation
  • Feb 9, 2017
  • WIREs Data Mining and Knowledge Discovery
  • Hien D Nguyen

  • Cite Count Icon 264
  • 10.1080/01621459.1971.10482297
An Analysis of Variance for Categorical Data
  • Sep 1, 1971
  • Journal of the American Statistical Association
  • Richard J Light + 1 more

  • Cite Count Icon 1841
  • 10.1201/b18401
Statistical Learning with Sparsity
  • May 7, 2015
  • Trevor Hastie + 2 more

  • Cite Count Icon 717
  • 10.1017/cbo9781316576533
Computer Age Statistical Inference
  • Jul 5, 2016
  • Bradley Efron + 1 more

  • Open Access Icon
  • Cite Count Icon 50
  • 10.1186/1471-2105-12-448
A flexible framework for sparse simultaneous component based data integration
  • Nov 15, 2011
  • BMC Bioinformatics
  • Katrijn Van Deun + 4 more

  • Cite Count Icon 144
  • 10.1002/9781118762875
Correspondence Analysis
  • Aug 29, 2014
  • Eric J Beh

  • Cite Count Icon 9
  • 10.1007/s00357-011-9070-3
Correspondence Analysis with Linear Constraints of Ordinal Cross-Classifications
  • Jan 12, 2011
  • Journal of Classification
  • Antonello D’Ambra + 1 more

  • Open Access Icon
  • Cite Count Icon 17703
  • 10.1111/j.1467-9868.2005.00503.x
Regularization and Variable Selection Via the Elastic Net
  • Mar 9, 2005
  • Journal of the Royal Statistical Society Series B: Statistical Methodology
  • Hui Zou + 1 more

  • Cite Count Icon 10
  • 10.1111/j.1467-842x.2009.00564.x
NON‐SYMMETRICAL CORRESPONDENCE ANALYSIS WITH CONCATENATION AND LINEAR CONSTRAINTS
  • Feb 17, 2010
  • Australian & New Zealand Journal of Statistics
  • Eric J Beh + 1 more

  • Cite Count Icon 1741
  • 10.1198/0003130042836
A Tutorial on MM Algorithms
  • Feb 1, 2004
  • The American Statistician
  • David R Hunter + 1 more

Similar Papers
  • Research Article
  • Cite Count Icon 12
  • 10.1007/s11634-009-0054-7
Tests of ignoring and eliminating in nonsymmetric correspondence analysis
  • Nov 14, 2009
  • Advances in Data Analysis and Classification
  • Yoshio Takane + 1 more

Nonsymmetric correspondence analysis (NSCA) aims to examine predictive relationships between rows and columns of a contingency table. The predictor categories of such tables are often accompanied by some auxiliary information. Constrained NSCA (CNSCA) incorporates such information as linear constraints on the predictor categories. However, imposing constraints also means that part of the predictive relationship is left unaccounted for by the constraints. A method of NSCA is proposed for analyzing the residual part along with the part accounted for by the constraints. The CATANOVA test may be invoked to test the significance of each part. The two tests parallel the distinction between tests of ignoring and eliminating, and help gain some insight into what is known as Simpson’s paradox in the analysis of contingency tables. Two examples are given to illustrate the distinction.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.csda.2008.09.004
Regularized nonsymmetric correspondence analysis
  • Sep 7, 2008
  • Computational Statistics & Data Analysis
  • Yoshio Takane + 1 more

Regularized nonsymmetric correspondence analysis

  • Research Article
  • Cite Count Icon 14
  • 10.1109/tase.2019.2941167
Process Modeling and Prediction With Large Number of High-Dimensional Variables Using Functional Regression
  • Oct 18, 2019
  • IEEE Transactions on Automation Science and Engineering
  • Mostafa Reisi Gahrooei + 3 more

Learning the relationship between a response variable (e.g., a quality characteristic) and a set of predictors (e.g., process variables) is of special importance in process modeling, prediction, and optimization. In many applications, not only is the number of these variables large but these variables are also high-dimensional (HD) (e.g., they are represented by waveform signals). This high dimensionality requires a systematic approach to both modeling the relationship between the variables and removing the noninformative input variables. This article proposes a functional regression method in which an HD response is estimated and predicted through a set of informative and noninformative HD covariates. For this purpose, the functional regression coefficients are expanded through a set of low-dimensional smooth basis functions. In order to estimate the low-dimensional set of parameters, a penalized loss function with both smoothing and group lasso penalties is defined. The block coordinate decent (BCD) method is employed to develop a computationally tractable algorithm for minimizing the loss function. Through simulations and case studies, the performance of the proposed method is evaluated and compared with benchmarks. The results illustrate the advantage of the proposed method over the benchmarks. Note to Practitioners —This article proposes a method for efficient and interpretable modeling of processes with high-dimensional (HD) data, such as waveform signals. Especially, the proposed method generates a regression model that predicts a function (e.g., a sensor’s readings over time) using several functional inputs. Existing functional regression techniques are mostly limited to a single functional input and are focused on profile data. In many applications, however, a large number of process variables are available for estimating an HD output, such as an image. This article addresses these problems by employing basis functions to reduce the dimension of functions and introducing specific penalties that removes noninformative inputs and improves computational efficiency. A model generated by the proposed approach can be used for process monitoring and optimization. Using simulation and case studies, the performance of the developed method is evaluated and compared with other methods under various scenarios. This can provide practitioners with useful guidelines for selecting an appropriate method for process modeling.

  • PDF Download Icon
  • Research Article
  • 10.11648/j.ijtam.20190502.11
A Review of Constrained Principal Component Analysis (CPCA) with Application on Bootstrap
  • Jan 1, 2019
  • International Journal of Theoretical and Applied Mathematics
  • Alaa Ahmed Abd Elmegaly

Linear model (LM) provide the advance in regression analysis, where it was considered an important statistical development of the last fifty years, following general linear model (GLM), principal component analysis (PCA) and constrained principal component analysis (CPCA) in the last thirty years. This paper introduce a series of papers prepared within the framework of an international workshop. Firstly, the LM and GLM has been discussed. Next, an overview of PCA has been presented. Then constrained principal component has been shown. Some of its special cases such as PCA, Canonical correlation analysis (CANO), Redundancy analysis (RA), Correspondence analysis (CA), Growth curve models (GCM), Extended growth curve models (ExGCM), Canonical discriminant analysis (CDA), Constrained correspondence analysis, non-symmetric correspondence analysis, Multiple Set CANO, Multiple Correspondence Analysis, Vector Preference Models, Seemingly unrelated regression (SUR), Weighted low rank approximations, Two-Way canonical decomposition with linear constraints, and Multilevel RA has been noted in this paper. Related methods and ordinary least squares (OLS) estimator as a special case form CPCA has been introduced. Finally, an example has been introduced to indicate the importance of CPCA and the different between PCA and CPCA. Where CPCA is a method for structural analysis of multivariate data that combine features of regression analysis and principal component analysis. In this method, the original data first decomposed into several components according to external information. The components then subjected to principal component analysis to explore structures within the components.

  • Peer Review Report
  • Cite Count Icon 3
  • 10.7554/elife.75600.sa2
Author response: Rapid, Reference-Free human genotype imputation with denoising autoencoders
  • Feb 23, 2022
  • Raquel Dias + 6 more

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here, we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least fourfold faster inference run time relative to standard imputation tools.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s11634-023-00573-3
Loss-guided stability selection
  • Dec 15, 2023
  • Advances in Data Analysis and Classification
  • Tino Werner

In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.

  • Research Article
  • 10.1360/n012013-00101
An algorithm for the estimation of regularization paths of generalized linear models with group LASSO penalty
  • Sep 1, 2015
  • SCIENTIA SINICA Mathematica
  • Xinlian Zhang + 3 more

Computing the regularization paths of generalized linear models (GLM) with group LASSO penalty can be decomposed into two problems: Selecting the path parameter and computing group LASSO solution ???1705-01??? (λ) given λ. In practice, grid method is usually used to solve the first one and coordinate descent algorithm based on the first order Taylor expansion of loss function of GLM is then used to solve the second. This paper aims at proposing algorithms that solve these two problems more efficiently. Firstly, we give a path following algorithm that attempts to find the λs that correspond to the change of active set. Secondly, we take advantage of the properties of GLM, and use second-order, instead of first-order, Taylor approximation of the loss function of GLM in coordinate descent method to achieve better precision in less time. Simulated and real data sets show that our algorithm is capable of efficiently pinpointing the critical λs that pair with changes of active set and that our proposed coordinate descent algorithm based on second-order approximation is competitive in speed compared with that based on second-order approximation.

  • Research Article
  • Cite Count Icon 11
  • 10.1080/00224065.2020.1805380
Functional directed graphical models and applications in root-cause analysis and diagnosis
  • Aug 8, 2020
  • Journal of Quality Technology
  • Ana María Estrada Gómez + 2 more

Directed graphical models aim to represent the probabilistic relationships between variables in a system. Learning a directed graphical model from data includes parameter learning and structure learning. Several methods have been developed for directed graphical models with scalar variables. However, the case in which the variables are infinite-dimensional has not been studied thoroughly. Nowadays, in many applications, the variables are infinite-dimensional signals that need to be treated as functional random variables. This article proposes a novel method to learn directed graphical models in the functional setting. When the structure of the graph is known, function-to-function linear regression is used to estimate the parameters of the graph. When the goal is to learn the structure, a penalized least square loss function with a group LASSO penalty, for variable selection, and an L 2 penalty, to handle group selection of nodes, is defined. Cyclic coordinate accelerated proximal gradient descent algorithm is employed to minimize the loss function and learn the structure of the directed graph. Through simulations and a case study, the advantage of the proposed method is proven.

  • Research Article
  • Cite Count Icon 1
  • 10.1007/bf01085923
Stability in the problem of statistical estimation and a choice of the loss function
  • Dec 1, 1981
  • Journal of Soviet Mathematics
  • L B Klebanov

A definition is given for stability of the problem of statistical estimation. It is shown that if a difference loss function is used and if the symmetric statistics form a complete class of estimates, then the problem of statistical estimation is unstable. Examples are given of nondifference loss functions for which the problem of estimation is stable.

  • Research Article
  • Cite Count Icon 2
  • 10.1080/10485252.2012.661054
Variable selection for semiparametric regression models with iterated penalisation
  • Apr 30, 2012
  • Journal of Nonparametric Statistics
  • Ying Dai + 1 more

Semiparametric regression models with multiple covariates are commonly encountered. When there are covariates that are not associated with a response variable, variable selection may lead to sparser models, more lucid interpretations and more accurate estimation. In this study, we adopt a sieve approach for the estimation of nonparametric covariate effects in semiparametric regression models. We adopt a two-step iterated penalisation approach for variable selection. In the first step, a mixture of Lasso and group Lasso penalties are employed to conduct the first-round variable selection and obtain the initial estimate. In the second step, a mixture of weighted Lasso and weighted group Lasso penalties, with weights constructed using the initial estimate, are employed for variable selection. We show that the proposed iterated approach has the variable selection consistency property, even when the number of unknown parameters diverges with sample size. Numerical studies, including simulation and analysis of a diabetes data set, show satisfactory performance of the proposed approach.

  • Research Article
  • Cite Count Icon 23
  • 10.3150/11-bej364
The log-linear group-lasso estimator and its asymptotic properties
  • Aug 1, 2012
  • Bernoulli
  • Yuval Nardi + 1 more

We define the group-lasso estimator for the natural parameters of the exponential families of distributions representing hierarchical log-linear models under multinomial sampling scheme. Such estimator arises as the solution of a convex penalized likelihood optimization problem based on the group-lasso penalty. We illustrate how it is possible to construct an estimator of the underlying log-linear model using the blocks of nonzero coefficients recovered by the group-lasso procedure. We investigate the asymptotic properties of the group-lasso estimator as a model selection method in a double-asymptotic framework, in which both the sample size and the model complexity grow simultaneously. We provide conditions guaranteeing that the group-lasso estimator is model selection consistent, in the sense that, with overwhelming probability as the sample size increases, it correctly identifies all the sets of nonzero interactions among the variables. Provided the sequences of true underlying models is sparse enough, recovery is possible even if the number of cells grows larger than the sample size. Finally, we derive some central limit type of results for the log-linear group-lasso estimator.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.csda.2021.107412
A likelihood-based boosting algorithm for factor analysis models with binary data
  • Dec 15, 2021
  • Computational Statistics & Data Analysis
  • Michela Battauz + 1 more

A likelihood-based boosting algorithm for factor analysis models with binary data

  • Research Article
  • Cite Count Icon 2
  • 10.1002/sta4.123
Asymptotic properties of adaptive group Lasso for sparse reduced rank regression
  • Jan 1, 2016
  • Stat
  • Kejun He + 1 more

This paper studies the asymptotic properties of the penalized least squares estimator using an adaptive group Lasso penalty for the reduced rank regression. The group Lasso penalty is defined in the way that the regression coefficients corresponding to each predictor are treated as one group. It is shown that under certain regularity conditions, the estimator can achieve the minimax optimal rate of convergence. Moreover, the variable selection consistency can also be achieved, that is, the relevant predictors can be identified with probability approaching one. In the asymptotic theory, the number of response variables, the number of predictors and the rank number are allowed to grow to infinity with the sample size. Copyright © 2016 John Wiley & Sons, Ltd.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.ympev.2021.107086
Assessing topological congruence among concatenation-based phylogenomic approaches in empirical datasets
  • Feb 18, 2021
  • Molecular Phylogenetics and Evolution
  • Ambrosio Torres + 2 more

Assessing topological congruence among concatenation-based phylogenomic approaches in empirical datasets

  • Research Article
  • Cite Count Icon 14
  • 10.1007/s11336-002-0998-4
Simultaneous Classification and Multidimensional Scaling with External Information
  • Sep 1, 2005
  • Psychometrika
  • Henk A L Kiers + 2 more

For the exploratory analysis of a matrix of proximities or (dis)similarities between objects, one often uses cluster analysis (CA) or multidimensional scaling (MDS). Solutions resulting from such analyses are sometimes interpreted using external information on the objects. Usually the procedures of CA, MDS and using external information are carried out independently and sequentially, although combinations of two of the three procedures (CA and MDS, or MDS and using external information) have been proposed in the literature. The present paper offers a procedure that combines all three procedures in one analysis, using a model that describes a partition of objects with cluster centroids represented in a low-dimensional space, which in turn is related to the information in the external variables. A simulation study is carried out to demonstrate that the method works satisfactorily for data with a known underlying structure. Also, to illustrate the method, it is applied to two empirical data sets.

More from: Advances in Data Analysis and Classification
  • New
  • Research Article
  • 10.1007/s11634-025-00659-0
Data-driven logistic regression ensembles with applications in genomics
  • Nov 25, 2025
  • Advances in Data Analysis and Classification
  • Anthony-Alexander Christidis + 2 more

  • Research Article
  • 10.1007/s11634-025-00660-7
Editorial for ADAC issue 4 of volume 19 (2025)
  • Nov 17, 2025
  • Advances in Data Analysis and Classification
  • Maurizio Vichi + 2 more

  • Research Article
  • 10.1007/s11634-025-00655-4
Low-bias discrimination of circular data with measurement errors
  • Oct 18, 2025
  • Advances in Data Analysis and Classification
  • Marco Di Marzio + 3 more

  • Research Article
  • 10.1007/s11634-025-00650-9
Two-stage principal component analysis on interval-valued data using patterned covariance structures
  • Jul 19, 2025
  • Advances in Data Analysis and Classification
  • Anuradha Roy

  • Addendum
  • 10.1007/s11634-025-00648-3
Correction to: Sparse correspondence analysis for large contingency tables
  • Jun 26, 2025
  • Advances in Data Analysis and Classification
  • Ruiping Liu + 3 more

  • Research Article
  • 10.1007/s11634-025-00646-5
Sparse constrained and unconstrained non-symmetric correspondence analysis
  • Jun 23, 2025
  • Advances in Data Analysis and Classification
  • Mark De Rooij + 1 more

  • Research Article
  • 10.1007/s11634-025-00651-8
Flexible multi-class cost-sensitive thresholding
  • Jun 22, 2025
  • Advances in Data Analysis and Classification
  • Jorge C-Rella + 1 more

  • Research Article
  • 10.1007/s11634-025-00639-4
Initialization strategies for clustering mixed-type data with the k-prototypes algorithm
  • Jun 12, 2025
  • Advances in Data Analysis and Classification
  • Rabea Aschenbruck + 2 more

  • Research Article
  • 10.1007/s11634-025-00643-8
Modeling time-dependent population proportions in a finite mixture model setting
  • Jun 6, 2025
  • Advances in Data Analysis and Classification
  • Igor Melnykov + 1 more

  • Research Article
  • 10.1007/s11634-025-00649-2
Increasing biases can be more efficient than increasing weights
  • Jun 1, 2025
  • Advances in Data Analysis and Classification
  • Carlo Metta + 10 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon