Abstract
1 BackgroundHigh throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.2 ResultsWe propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of Escherichia coli samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.3 ConclusionSparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).4 AvailabilityThe additional file contains a MATLAB implementation of the sparse simultaneous component method.
Highlights
High throughput data are complex and methods that reveal structure underlying the data are most useful
Algorithm Notation We will make use of the following formal notation: matrices are denoted by bold uppercases, vectors by bold lower case, the transpose by the superscript T, and the cardinality by the capital of the letter used to run the index, see [15]
Model Simultaneous component analysis is an extension of principal component analysis (PCA) to the case of multiple coupled data matrices
Summary
High throughput data are complex and methods that reveal structure underlying the data are most useful. Examples where several measurement platforms such data simultaneously instead of separately or sequentially as this yields an aggregated view In this respect, simultaneous component methods, that are an extension of principal component analysis (PCA) to the case of multiple coupled data blocks, were proposed and successfully used [4,5,6,7]. The problem holds even more for simultaneous component methods as these involve multiple large sets of variables To deal with this issue, sparse approaches have been proposed mainly within the context of regression analysis (e.g., [9,10]) and for principal component analysis [8,11,12,13,14]: These select a limited number of variables by shrinking many of the weights to zero which is accomplished by proper penalization of these (regression) weights. The estimation procedure used is efficient and the associated MATLAB code can be found in the additional file
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.