Abstract

DNA microarray technologies are used extensively to profile the expression levels of thousands of genes under various conditions, yielding extremely large data-matrices. Thus, analyzing this information and extracting biologically relevant knowledge becomes a considerable challenge. A classical approach for tackling this challenge is to use clustering (also known as one-way clustering) methods where genes (or respectively samples) are grouped together based on the similarity of their expression profiles across the set of all samples (or respectively genes). An alternative approach is to develop biclustering methods to identify local patterns in the data. These methods extract subgroups of genes that are co-expressed across only a subset of samples and may feature important biological or medical implications. In this study we evaluate 13 biclustering and 2 clustering (k-means and hierarchical) methods. We use several approaches to compare their performance on two real gene expression data sets. For this purpose we apply four evaluation measures in our analysis: (1) we examine how well the considered (bi)clustering methods differentiate various sample types; (2) we evaluate how well the groups of genes discovered by the (bi)clustering methods are annotated with similar Gene Ontology categories; (3) we evaluate the capability of the methods to differentiate genes that are known to be specific to the particular sample types we study and (4) we compare the running time of the algorithms. In the end, we conclude that as long as the samples are well defined and annotated, the contamination of the samples is limited, and the samples are well replicated, biclustering methods such as Plaid and SAMBA are useful for discovering relevant subsets of genes and samples.

Highlights

  • Modern high-throughput measurement technologies, such as microarrays, are able to quantify expression levels for tens of thousands of genes in various organisms

  • We found that gene-lists discovered by the CTWC, FABIA, ISA, Plaid, SAMBA, and hierarchical clustering were significantly enriched with GO terms: cell cycle, M phase of the cell cycle, mitosis, cell division, proliferation, and response to stress

  • Our results show that Plaid, SAMBA, CTWC, hierarchical clustering, constant MSBE, and FABIA methods best distinguished the various sample-types in the multi-tissue type gene expression matrix

Read more

Summary

Introduction

Modern high-throughput measurement technologies, such as microarrays, are able to quantify expression levels for tens of thousands of genes in various organisms. Hierarchical clustering with heatmap visualization [4], k-means clustering and self-organizing maps [5,6] have been successful in finding biologically important groups of genes or samples. These methods, do not take full advantage of the data as clustering is done first for genes and for samples (or vice versa). Wang et al [8] used a biclustering algorithm (CMonkey [9]) to group breast tumors from 437 individuals based on the expression profiles of specific genes They reported that it is possible to identify co-expressed gene-sets in the subgroups of breast tumor samples using biclustering methods

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call