Abstract

BackgroundCollective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets.ResultsHere, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn.ConclusionsThe UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0614-0) contains supplementary material, which is available to authorized users.

Highlights

  • Collective analysis of the increasingly emerging gene expression datasets are required

  • Some studies have used a core subset of genes that are well known to participate in the target pathway as a template, and many microarray datasets were mined for the genes that are consistently co-expressed with that template of genes [2, 6]

  • In contrast we have recently proposed the binarisation of consensus partition matrices (Bi-consensus partition matrix (CoPaM)) method [10], which has the unique ability to address, in an unsupervised way, the research question: which are the subsets of genes that are consistently co-expressed over a set of genome-wide datasets? Those datasets could have been generated under different conditions and biological contexts, and even from different species [11]

Read more

Summary

Introduction

Collective analysis of the increasingly emerging gene expression datasets are required. Some studies have used a core subset of genes that are well known to participate in the target pathway as a template, and many microarray datasets were mined for the genes that are consistently co-expressed with that template of genes [2, 6] One drawback of this approach is that it cannot be applied without the availability of a starting template of co-expressed genes. In contrast we have recently proposed the binarisation of consensus partition matrices (Bi-CoPaM) method [10], which has the unique ability to address, in an unsupervised way, the research question: which are the subsets of genes that are consistently co-expressed over a set of genome-wide (or filtered) datasets? In contrast we have recently proposed the binarisation of consensus partition matrices (Bi-CoPaM) method [10], which has the unique ability to address, in an unsupervised way, the research question: which are the subsets of genes that are consistently co-expressed over a set of genome-wide (or filtered) datasets? Those datasets could have been generated under different conditions and biological contexts, and even from different species [11]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.