Abstract

Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to fit the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the fitting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algorithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efficient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large. As a complement to the proposed methods and to give practical support, a package has been written in the R language called BiplotML. To complete the study, real binary data on gene expression methylation are used to illustrate the proposed methods.

Highlights

  • In many studies, researchers have a binary multivariate data matrix and aim to reduce dimensions to investigate the structure of the data

  • We propose the estimation of the parameters of the Logistic Biplot (LB) model in two different ways: one of these is to use conjugate gradient (CG) methods, and the other way is to use a coordinate descendent MM algorithm

  • The Logistic Biplot (LB) model is a dimensionality reduction technique that generalizes the principal component analysis (PCA) to deal with binary variables and has the advantage of simultaneously representing individuals and variables

Read more

Summary

Introduction

Researchers have a binary multivariate data matrix and aim to reduce dimensions to investigate the structure of the data. In biological research—and in particular in the analysis of genetic and epigenetic alterations—the amount of binary data has been increasing over time [5]. In these cases, classical methods to reduce dimensionality, such as principal component analysis (PCA), are not appropriate. Collins et al [6] provide a generalization of PCA to exponential family data using the generalized linear model framework. This approach suggests the possibility of having proper likelihood loss functions depending on the type of data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.