SOLVE: A structured orthogonal latent variable framework for disentangling confounding in matrix data
Latent factor models are valuable in bioinformatics for accounting for unmeasured variation alongside observed covariates. Yet many methods struggle to separate known effects from latent structure and to handle losses beyond standard regression. We present a unified framework that augments row and column predictors with a low-rank latent component, jointly modeling measured effects and residual variation. To remove ambiguity in estimating observed and latent effects, we impose a carefully designed set of orthogonality constraints on the coefficient and latent factor matrices, relative to the spans of the predictor matrices. These constraints ensure identifiability, yield a decomposition in which the latent term captures only variation unexplained by the covariates, and improve interpretability. An efficient algorithm handles general non-quadratic losses via surrogates with monotone descent. Each iteration updates the latent term by truncated singular value decomposition of a doubly projected residual and refines coefficients by projections. The number of latent factors is selected by applying an elbow rule to a degrees-of-freedom-adjusted information criterion. A parametric bootstrap provides valid inference on feature-outcome associations under the regularized low-rank structure. Applied to real pharmacogenomic data, the method recovers biologically coherent gene-drug associations missed by standard factor models, such as the EGFR-inhibitor link, highlights novel candidates with plausible mechanisms, and reveals gene programs aligned with compound modes of action, including a latent unfolded-protein-response module affecting drug sensitivity. These results support the framework’s utility for precision oncology, yielding stronger biomarkers for patient stratification and deeper insight into drug resistance mechanisms.
- Front Matter
334
- 10.3389/fpsyg.2015.01064
- Jul 28, 2015
- Frontiers in Psychology
Multi-item surveys are frequently used to study scores on latent factors, like human values, attitudes, and behavior. Such studies often include a comparison, between specific groups of individuals or residents of different countries, either at one or multiple points in time (i.e., a cross-sectional or a longitudinal comparison or both). If latent factor means are to be meaningfully compared, the measurement structures of the latent factor and their survey items should be stable, that is “invariant.” As proposed by Mellenbergh (1989), “measurement invariance” (MI) requires that the association between the items (or test scores) and the latent factors (or latent traits) of individuals should not depend on group membership or measurement occasion (i.e., time). In other words, if item scores are (approximately) multivariate normally distributed, conditional on the latent factor scores, the expected values, the covariances between items, and the unexplained variance unrelated to the latent factors should be equal across groups. Many studies examining MI of survey scales have shown that the MI assumption is very hard to meet. In particular, strict forms of MI rarely hold. With “strict” we refer to a situation in which measurement parameters are exactly the same across groups or measurement occasions, that is an enforcement of zero tolerance with respect to deviations between groups or measurement occasions. Often, researchers just ignore MI issues and compare latent factor means across groups or measurement occasions even though the psychometric basis for such a practice does not hold. However, when a strict form of MI is not established and one must conclude that respondents attach different meanings to survey items, this makes it impossible to make valid comparisons between latent factor means. As such, the potential bias caused by measurement non-invariance obstructs the comparison of latent factor means (if strict MI does not hold) or regression coefficients (if less strict forms of MI do not hold). Traditionally, MI is tested for in a multiple group confirmatory factor analysis (MGCFA) with groups defined by unordered categorical (i.e., nominal) between-subject variables. In MGCFA, MI is tested at each constraint of the latent factor model using a series of nested (latent) factor models. This traditional way of testing for MI originated with Joreskog (1971), who was the first scholar to thoroughly discuss the invariance of latent factor (or measurement) structures. Additionally, Sorbom (1974, 1978) pioneered the specification and estimation of latent factor means using a multi-group SEM approach in LISREL (Joreskog and Sorbom, 1996). Following these contributions the multi-group specification of latent factor structures has become widespread in all major SEM software programs (e.g., AMOS Arbuckle, 2006, EQS Bender and Wu, 1995, LAVAAN Rosseel, 2012, Mplus Muthen and Muthen, 2013, STATA STATA, 2015, and OpenMx Boker et al., 2011). Shortly thereafter, Byrne et al. (1989) introduced the distinction between full and partial MI. Although their introduction was of great value, the first formal treatment of different forms of MI and their consequences for the validity of multi-group/multi-time comparisons is attributable to Meredith (1993). So far, a tremendous amount of papers dealing with MI have been published. The literature on MI published in the 20th century is nicely summarized by Vandenberg and Lance (2000). Noteworthy is also the overview of applications in cross-cultural studies provided by Davidov et al. (2014), as well as a recent book by Millsap (2011) containing a general systematic treatment of the topic of MI. The traditional MGCFA approach to MI-testing is described by, for example, Byrne (2004), Chen et al. (2005), Gregorich (2006), van de Schoot et al. (2012), Vandenberg (2002) and Wicherts and Dolan (2010). Researchers entering the field of MI are recommended to first consult Meredith (1993) and Millsap (2011) before reading other valuable academic works. Recent developments in statistics have provided new analytical tools for assessing MI. The aim of this special issue is to provide a forum for a discussion of MI, covering some crucial “themes”: (1) ways to assess and deal with measurement non-invariance; (2) Bayesian and IRT methods employing the concept of approximate MI; and (3) new or adjusted approaches for testing MI to fit increasingly complex statistical models and specific characteristics of survey data.
- Conference Article
10
- 10.1109/icnsc.2018.8361355
- Mar 1, 2018
Latent factor (LF) models are highly effective in extracting useful knowledge from High-Dimensional and Sparse (HiDS) matrices which are commonly seen in various industrial applications. An LF model usually adopts iterative optimizers, which may consume many iterations to achieve a local optima, resulting in considerable time cost. Hence, how to accelerate the training process of an LF model becomes a highly significant issue. To address it, this work innovatively proposes a randomized latent factor (RLF) model. It incorporates the principle of randomized learning techniques for neural networks into the LF analysis on HiDS matrices to alleviate the computational burden greatly. It also extends the standard learning process for randomized neural networks in context of LF analysis to make the resulting model represent an HiDS matrix correctly. Experimental results on three HiDS matrices from industrial applications demonstrate that compared with state-of-the-art LF models, RLF is able to achieve significantly higher computational efficiency and comparable prediction accuracy for missing data. More importantly, it provides a novel, effective, and efficient approach to LF analysis on HiDS matrices.
- Conference Article
4
- 10.1109/icbk50248.2020.00075
- Aug 1, 2020
How to accurately represent a high-dimensional and sparse (HiDS) user-item rating matrix is a crucial issue in implementing a recommender system. A latent factor (LF) model is one of the most popular and successful approaches to address this issue. It is developed by minimizing the errors between the observed entries and the estimated ones on an HiDS matrix. Current studies commonly employ L 2 -norm to minimize the errors because it has a smooth gradient, making a resultant LF model can accurately represent an HiDS matrix. As is well known, however, L 2 -norm is very sensitive to the outlier data or called unreliable ratings in the context of the recommender system. Unfortunately, the unreliable ratings often exist in an HiDS matrix due to some malicious users. To address this issue, this paper proposes a Smooth L 1 -norm-oriented Latent Factor (SL 1 -LF) model. Its main idea is to employ smooth L 1 -norm rather than L 2 -norm to minimize the errors, making it have both high robustness and accuracy in representing an HiDS matrix. Experimental results on four HiDS matrices generated by industrial recommender systems demonstrate that the proposed SL 1 -LF model is robust to the outlier data and has significantly higher prediction accuracy than state-of-the-art models for the missing data of an HiDS matrix.
- Research Article
- 10.15587/1729-4061.2012.4525
- Jan 1, 2012
- Eastern-European Journal of Enterprise Technologies
The estimation of road traffic safety is the fundamental stage for developing measures aimed at preventing and decreasing the number of accidents. The aim of the research is to define latent factors describing the traffic conditions parameters influencing the road traffic safety level. The number of latent factors is supposed to be less than one of parameters in the methodics of the final accidents coefficient determination after V. F. Babkof, professor. Within the given research the problems proving the application of factors analysis methods, defining the necessary number of factors describing the initial data and developing latent factors simple structure as well as receiving factors functions which describe the most part of accidents coefficient variety are set and solved. To prove the application of factors analysis methods the linear model of final accidents coefficient dependence on particular accidents coefficients is developed. Kaiser and Kettle criteria to choose the number of new latent factors are applied. Developing a new simple structure of latent factors, application of “Varymax” being the method of factors rotation, is grounded. The received factors describe more than 80% of variety of initial data values of particular accidents coefficients. The presented results are of great importance for the further research in the field of improvement of the methods to define the road traffic safety level under the traffic conditions.
- Research Article
23
- 10.1109/tnnls.2023.3321915
- Jan 1, 2025
- IEEE transactions on neural networks and learning systems
High-dimensional and incomplete (HDI) data are frequently encountered in big date-related applications for describing restricted observed interactions among large node sets. How to perform accurate and efficient representation learning on such HDI data is a hot yet thorny issue. A latent factor (LF) model has proven to be efficient in addressing it. However, the objective function of an LF model is nonconvex. Commonly adopted first-order methods cannot approach its second-order stationary point, thereby resulting in accuracy loss. On the other hand, traditional second-order methods are impractical for LF models since they suffer from high computational costs due to the required operations on the objective's huge Hessian matrix. In order to address this issue, this study proposes a generalized Nesterov-accelerated second-order LF (GNSLF) model that integrates twofold conceptions: 1) acquiring proper second-order step efficiently by adopting a Hessian-vector algorithm and 2) embedding the second-order step into a generalized Nesterov's acceleration (GNA) method for speeding up its linear search process. The analysis focuses on the local convergence for GNSLF's nonconvex cost function instead of the global convergence has been taken; its local convergence properties have been provided with theoretical proofs. Experimental results on six HDI data cases demonstrate that GNSLF performs better than state-of-the-art LF models in accuracy for missing data estimation with high efficiency, i.e., a second-order model can be accelerated by incorporating GNA without accuracy loss.
- Conference Article
10
- 10.1109/icbk50248.2020.00074
- Aug 1, 2020
The valuable knowledge contained in High-dimensional and Sparse (HiDS) matrices can be efficiently extracted by a latent factor (LF) model. Regularization techniques are widely incorporated into an LF model to avoid overfitting. The regularization coefficient is very crucial to the prediction accuracy of models. However, its tuning process is time-consuming and boring. This study aims at making the regularization coefficient of a regularized LF model self-adaptive. To do so, an adaptive particle swarm optimization (APSO) algorithm is introduced into a regularized LF model to automatically select the optimal regularization coefficient. Then, to enhance the global search capability of particles, we further propose an APSO and particle swarm optimization (PSO)-incorporated (AP) algorithm, thereby achieving an AP-based LF (APLF) model. Experimental results on four HiDS matrices generated by real applications demonstrate that an APLF model can achieve an automatic selection of regularization coefficient, and is superior to a regularized LF model in terms of prediction accuracy and computational efficiency.
- Research Article
7
- 10.1097/aud.0000000000001430
- Oct 26, 2023
- Ear and hearing
The link between memory ability and speech recognition accuracy is often examined by correlating summary measures of performance across various tasks, but interpretation of such correlations critically depends on assumptions about how these measures map onto underlying factors of interest. The present work presents an alternative approach, wherein latent factor models are fit to trial-level data from multiple tasks to directly test hypotheses about the underlying structure of memory and the extent to which latent memory factors are associated with individual differences in speech recognition accuracy. Latent factor models with different numbers of factors were fit to the data and compared to one another to select the structures which best explained vocoded sentence recognition in a two-talker masker across a range of target-to-masker ratios, performance on three memory tasks, and the link between sentence recognition and memory. Young adults with normal hearing (N = 52 for the memory tasks, of which 21 participants also completed the sentence recognition task) completed three memory tasks and one sentence recognition task: reading span, auditory digit span, visual free recall of words, and recognition of 16-channel vocoded Perceptually Robust English Sentence Test Open-set sentences in the presence of a two-talker masker at target-to-masker ratios between +10 and 0 dB. Correlations between summary measures of memory task performance and sentence recognition accuracy were calculated for comparison to prior work, and latent factor models were fit to trial-level data and compared against one another to identify the number of latent factors which best explains the data. Models with one or two latent factors were fit to the sentence recognition data and models with one, two, or three latent factors were fit to the memory task data. Based on findings with these models, full models that linked one speech factor to one, two, or three memory factors were fit to the full data set. Models were compared via Expected Log pointwise Predictive Density and post hoc inspection of model parameters. Summary measures were positively correlated across memory tasks and sentence recognition. Latent factor models revealed that sentence recognition accuracy was best explained by a single factor that varied across participants. Memory task performance was best explained by two latent factors, of which one was generally associated with performance on all three tasks and the other was specific to digit span recall accuracy at lists of six digits or more. When these models were combined, the general memory factor was closely related to the sentence recognition factor, whereas the factor specific to digit span had no apparent association with sentence recognition. Comparison of latent factor models enables testing hypotheses about the underlying structure linking cognition and speech recognition. This approach showed that multiple memory tasks assess a common latent factor that is related to individual differences in sentence recognition, although performance on some tasks was associated with multiple factors. Thus, while these tasks provide some convergent assessment of common latent factors, caution is needed when interpreting what they tell us about speech recognition.
- Research Article
18
- 10.1016/j.knosys.2017.02.010
- Feb 14, 2017
- Knowledge-Based Systems
Performance of latent factor models with extended linear biases
- Research Article
118
- 10.1109/tnnls.2022.3200009
- Mar 1, 2024
- IEEE Transactions on Neural Networks and Learning Systems
Performing highly accurate representation learning on a high-dimensional and sparse (HiDS) matrix is of great significance in a big data-related application such as a recommender system. A latent factor (LF) model is one of the most efficient approaches to the HiDS matrix representation. However, an LF model's representation learning ability relies heavily on an HiDS matrix's known data density, which is extremely low due to numerous missing data entities. To address this issue, this work proposes a prediction-sampling-based multilayer-structured LF (PMLF) model with twofold ideas: 1) constructing a loosely connected multilayered LF architecture to increase the known data density of an input HiDS matrix by generating synthetic data layer by layer and 2) constraining this synthetic data generating process through a random prediction-sampling strategy and nonlinear activations to avoid overfitting. In the experiments, PMLF is compared with six state-of-the-art LF-and deep neural network (DNN)-based models on four HiDS matrices from industrial applications. The results demonstrate that PMLF outperforms its peers in well-balancing prediction accuracy and computational efficiency.
- Conference Article
2
- 10.1109/icnsc48988.2020.9238055
- Oct 30, 2020
High-dimensional and sparse (HiDS) matrices generated by recommender systems (RSs) contain rich knowledge. A latent factor (LF) model can address such data effectively. Stochastic gradient descent (SGD) is an efficient algorithm for building a LF model on an HiDS matrix. However, it suffers slow convergence. To address this issue, this study proposes to implement a LF model with a proportional integral derivative (PID) controller. The main idea is to continuously apply a correction for SGD to accelerate the training process. Based on such design, a PID-based LF (PLF) model is proposed. Empirical studies on two HiDS matrices from RSs indicate that a PLF model outperforms an LF model in terms of both convergence rate and prediction accuracy for missing data.
- Research Article
3
- 10.1007/s11222-014-9540-7
- Dec 9, 2014
- Statistics and Computing
We consider the problem of estimating covariance matrices of a particular structure that is a summation of a low-rank component and a sparse component. This is a general covariance structure encountered in multiple statistical models including factor analysis and random effects models, where the low-rank component relates to the correlations among variables coming from the latent factors or random effects and the sparse component displays the correlations of the remaining residuals. We propose a Bayesian method for estimating the covariance matrices of such structures by representing the covariance model in the form of a factor model with an unknown number of latent factors. We introduce binary indicators for factor selection and rank estimation for the low-rank component, combined with a Bayesian lasso method for the estimation of the sparse component. Simulation studies show that our method can recover the rank as well as the sparsity of the two respective components. We further extend our method to a latent-factor Markov graphical model, with a focus on the sparse conditional graphical model of the residuals as well as selecting the number of factors. We show through simulations that our Bayesian model can successfully recover both the number of latent factors and the Markov graphical model of the residuals.
- Research Article
556
- 10.1198/016214508000000869
- Dec 1, 2008
- Journal of the American Statistical Association
We describe studies in molecular profiling and biological pathway analysis that use sparse latent factor and regression models for microarray gene expression data. We discuss breast cancer applications and key aspects of the modeling and computational methodology. Our case studies aim to investigate and characterize heterogeneity of structure related to specific oncogenic pathways, as well as links between aggregate patterns in gene expression profiles and clinical biomarkers. Based on the metaphor of statistically derived “factors” as representing biological “subpathway” structure, we explore the decomposition of fitted sparse factor models into pathway subcomponents and investigate how these components overlay multiple aspects of known biological activity. Our methodology is based on sparsity modeling of multivariate regression, ANOVA, and latent factor models, as well as a class of models that combines all components. Hierarchical sparsity priors address questions of dimension reduction and multiple comparisons, as well as scalability of the methodology. The models include practically relevant non-Gaussian/nonparametric components for latent structure, underlying often quite complex non-Gaussianity in multivariate expression patterns. Model search and fitting are addressed through stochastic simulation and evolutionary stochastic search methods that are exemplified in the oncogenic pathway studies. Supplementary supporting material provides more details of the applications, as well as examples of the use of freely available software tools for implementing the methodology.
- Conference Article
25
- 10.1109/icdm50108.2020.00076
- Nov 1, 2020
A latent factor (LF) model can implement efficient analysis for a high-dimensional and sparse (HiDS) matrix from recommender systems (RSs). However, an LF model's representation learning ability to a targeted HiDS matrix is heavily proportional to its known data density. Unfortunately, an HiDS matrix's known data are limited due to users' activity limitations in RSs. Motivated by this observation, this paper proposes a Prediction-sampling-based Multilayer-structured Latent Factor (PMLF) model. Following the principle of Deep Forest [1], PMLF implements a loosely-connected multilayered LF structure, where each layer generates synthetic ratings to enrich the input for the next layer. Such an injection process is carefully monitored through a random sampling process and nonlinear activations to avoid overfitting. Thus, PMLF's representation learning ability to an HiDS matrix is significantly enhanced owing to the carefully injected estimates and its generalized multilayer-structure. Experimental results on four HiDS matrices from industrial RSs indicate that compared with six state-of-the-art LF-based and deep neural networks-based models, PMLF well balances the prediction accuracy and computational efficiency, making it satisfy demands of fast and accurate industrial applications.
- Research Article
13
- 10.1016/j.eswa.2017.05.058
- Jun 7, 2017
- Expert Systems with Applications
Modelling socially-influenced conditional preferences over feature values in recommender systems based on factorised collaborative filtering
- Conference Article
6
- 10.1109/smc42975.2020.9283344
- Oct 11, 2020
Recommender system (RS) commonly describes its user-item preferences with a high-dimensional and sparse (HiDS) matrix. A latent factor (LF) model relying on stochastic gradient descent (SGD) is frequently adopted to extract useful information from such an HiDS matrix. In spite of its efficiency, an SGD-based LF model commonly takes many iterations to converge. When processing a large-scale HiDS matrix, its computational efficiency should be further improved by further accelerating its convergence rate as well as maintaining its learning ability. To address this issue, this paper innovatively proposes novel SGD algorithm which incorporates a nonlinear proportional integral derivative (NPID) controller into its learning scheme for building an LF model. The main idea is to adopt an NPID controller to model the learning residual achieved in the past iterations, thereby adjusting the learning direction and step size of the current iteration, thereby making a resultant model converge fast. With the NPID-incorporated SGD algorithm, this study proposes an NPID-SGD-based LF (NSLF) model. Experimental results on two HiDS matrices demonstrate that compared with a standard SGD-based LF model, the proposed model achieves higher computational efficiency and prediction accuracy for missing data of an HiDS matrix.