Abstract

BackgroundThe joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics.MethodsWe propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone.ResultsApplication of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease.

Highlights

  • The extraction of information arising from the integration of multiple datasets and its translation into domain knowledge is a significant problem in several fields

  • L-2695, L-3044 and L-3289 datasets are available in Gene Expression Omnibus under accession number GSE3933

  • The structure of the article is as follows; the materials and methods employed in this paper are explained in detail in Section 2; in Section 3 we present our results by applying the proposed integration and feature selection method on prostate cancer datasets

Read more

Summary

Introduction

The extraction of information arising from the integration of multiple datasets and its translation into domain knowledge is a significant problem in several fields. More and more biology and health related studies around the world are engaging in the useful policy of leaving their raw results available for the common good via public domain databases This open sharing has benefitted the reproducibility of other researchers’ findings. The existing online datasets are becoming very useful for the development of new mathematical and computational approaches for pattern recognition, machine learning and artificial intelligence methods This healthy practice of sharing data is being increasingly adopted by governments and scientific journals. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. We introduce a new model for the integration of multiple datasets and we show its application in transcriptomics

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.