Abstract

Abstract Background: The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way. Self organizing maps (SOM), a machine learning method, enables the parallel sample- and gene-centered view on the data combined with strong visualization and second-level analysis capabilities. The paper addresses aspects of the method with practical impact in the context of expression analysis of complex data sets. Results: The method was applied to generate a SOM characterizing the whole genome expression profiles of 67 healthy human tissues selected from ten tissue categories (adipose, endocrine, homeostasis, digestion, exocrine, epithelium, sexual reproduction, muscle, immune system and nervous tissues). SOM mapping reduces the dimension of expression data from ten thousands of genes to a few thousands of metagenes where each metagene acts as representative of a minicluster of co-regulated single genes. Tissue-specific and common properties shared between groups of tissues emerge as a handful of localized spots in the tissue maps collecting groups of co-regulated and co-expressed metagenes. The functional context of the spots was discovered using overrepresentation analysis with respect to pre-defined gene sets of known functional impact. We found that tissue related spots typically contain enriched populations of gene sets well corresponding to molecular processes in the respective tissues. Analysis techniques normally used at the gene-level such as two-way hierarchical clustering provide a better signal-to-noise ratio and a better representativeness of the method if applied to the metagenes. Metagene-based clustering analyses aggregate the tissues into essentially three clusters containing nervous, immune system and the remaining tissues. Conclusions: The global view on the behavior of a few well-defined modules of correlated and differentially expressed genes is more intuitive and more informative than the separate discovery of the expression levels of hundreds or thousands of individual genes. The metagene approach is less sensitive to a priori selection of genes. It can detect a coordinated expression pattern whose components would not pass single-gene significance thresholds and it is able to extract context-dependent patterns of gene expression in complex data sets.

Highlights

  • The availability of parallel, high-throughput microarray and sequencing experiments poses a challenge how to best arrange and to analyze the obtained heap of multidimensional data in a concerted way

  • The color gradient of the map was chosen to visualize overor underexpression of the metagenes in the particular tissue compared with the mean expression level of each metagene in the pool of all samples studied: Maroon codes the highest level of gene expression; red, yellow and green indicate intermediate levels and blue corresponds to the lowest level of gene expression

  • The physiology of tongue tissue as a ‘mucosa covered muscle’ is reflected in the expression profile. Another example is pituatary gland, an endocrine gland located near hypothalamus: Its Self organizing maps (SOM) landscape shows the upregulated spot found in other nervous system tissues in the left upper corner, as well as a unique spot in the right lower area not found in the profiles of other tissues. This spot obviously collects genes which are overexpressed in pituatary gland, whereas the first spot represents a common signature typically found in nervous system samples

Read more

Summary

Background

High-throughput biological experiments that simultaneously monitor thousands of molecular observables provides an opportunity for investigating cellular behavior at multiple levels of resolution. We apply gene set overrepresentation analysis to visualization space on two different levels of data compression given by the metagenes and by spots of metagenes, respectively This grouping of coexpressed genes enables to significantly reduce the dimensionality of expression data from ten thousands of single genes to a handful of representative features. The samples are well classified in terms of distinct tissues and tissue categories allowing the clear assignment of expression pattern Despite these methodical issues the discovery of the human body index data set in this study is motivated by the argument that tissue-specific RNA expression pattern indicate important clues to the physiological function of the coding genes suitable as a reference for comparison with diseased tissues, as well as a basis for identifying molecular markers of injury to specific organs and tissues. Our analysis provides a first step towards a SOM atlas of gene activity in normal human tissues which complements previous work on the diversity of gene expression in human tissues [20,21,22]

Results and discussion
Metagene characteristics and overexpression spots
Filtering metagenes and single genes
Metagene- and single genes-based clustering analysis
Metagene- and single gene-based correlation analyses
Sample cartography
Summary and Conclusions
Microarray Data
Preprocessing of microarray intensities
SOM-mapping of gene expression profiles
Supporting maps
Gene set overrepresentation analysis
Grouping samples
Estimating similarities
Kohonen T
26. Liebermeister W
Clustering metagenes and single genes
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call