Abstract

BackgroundThe analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets in reasonable time.ResultsHere we present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets.ConclusionsWe demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from different labs or platforms.

Highlights

  • The analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems

  • Expressed biclusters lead to functionally more coherent gene sets compared to standard clustering or biclustering algorithms

  • We evaluated the performance of DeBi on a yeast dataset [13], on synthetic datasets [10], on the connectivity map dataset which is a reference collection of gene expression profiles from human cells that have been treated with a variety of drugs [14], gene expression profiles of 2158 human tumor samples published by expO (Expression Project for Oncology), on diffuse large B-cell lymphoma (DLBCL) dataset [15] and on gene sets from the Molecular Signature Database (MSigDB) C2 category

Read more

Summary

Introduction

The analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. The most common approach for detecting functionally related gene sets from such high throughput data is clustering [1]. Traditional clustering methods like hierarchical clustering [2] and k-means [3], have several limitations. They are based on the assumption that a cluster of genes behaves in all samples. Some genes may not be active in any of the samples and some genes may participate in multiple processes

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.