Abstract

BackgroundIn systems biology it is common to obtain for the same set of biological entities information from multiple sources. Examples include expression data for the same set of orthologous genes screened in different organisms and data on the same set of culture samples obtained with different high-throughput techniques. A major challenge is to find the important biological processes underlying the data and to disentangle therein processes common to all data sources and processes distinctive for a specific source. Recently, two promising simultaneous data integration methods have been proposed to attain this goal, namely generalized singular value decomposition (GSVD) and simultaneous component analysis with rotation to common and distinctive components (DISCO-SCA).ResultsBoth theoretical analyses and applications to biologically relevant data show that: (1) straightforward applications of GSVD yield unsatisfactory results, (2) DISCO-SCA performs well, (3) provided proper pre-processing and algorithmic adaptations, GSVD reaches a performance level similar to that of DISCO-SCA, and (4) DISCO-SCA is directly generalizable to more than two data sources. The biological relevance of DISCO-SCA is illustrated with two applications. First, in a setting of comparative genomics, it is shown that DISCO-SCA recovers a common theme of cell cycle progression and a yeast-specific response to pheromones. The biological annotation was obtained by applying Gene Set Enrichment Analysis in an appropriate way. Second, in an application of DISCO-SCA to metabolomics data for Escherichia coli obtained with two different chemical analysis platforms, it is illustrated that the metabolites involved in some of the biological processes underlying the data are detected by one of the two platforms only; therefore, platforms for microbial metabolomics should be tailored to the biological question.ConclusionsBoth DISCO-SCA and properly applied GSVD are promising integrative methods for finding common and distinctive processes in multisource data. Open source code for both methods is provided.

Highlights

  • In biology several important research questions focus on the integration of data that come from different sources but that are gathered under the same set of conditions or for the same set of biomolecules

  • Examples where different measurement platforms form the different sources are the integration of ChIP-chip, motif, and expression data collected for the same set of genes [5] and metabolomics data obtained for the same set of Escherichia coli samples using either gas chromatography mass spectrometry (GC-MS) or liquid chromatography mass spectrometry (LC-MS) as a chemical analysis method

  • The performance of generalized singular value decomposition (GSVD) and DISCO-simultaneous components analysis (SCA) is compared first for simulated data and for two empirical data sets; a first empirical data set is on comparative genomics using synchronized cell cycle experiments for the human, and yeast genomes and a second one is on coupled metabolomics data as obtained for the same samples of E. coli but using different chemical analysis methods

Read more

Summary

Introduction

In biology several important research questions focus on the integration of data that come from different sources (e.g., organisms, measurement platforms) but that are gathered under the same set of conditions or for the same set of biomolecules (e.g., genes, metabolites). Examples where different measurement platforms form the different sources are the integration of ChIP-chip, motif, and expression data collected for the same set of genes [5] and metabolomics data obtained for the same set of Escherichia coli samples using either gas chromatography mass spectrometry (GC-MS) or liquid chromatography mass spectrometry (LC-MS) as a chemical analysis method. In all these examples, the use of multiple sources to collect data on the same set of entities leads to data consisting of multiple data blocks; this introduces a problem of data fusion. Two promising simultaneous data integration methods have been proposed to attain this goal, namely generalized singular value decomposition (GSVD) and simultaneous component analysis with rotation to common and distinctive components (DISCO-SCA)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call