Abstract

BackgroundAn exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. Most data generated using such methods are used to encode large amounts of rich information, and determine diagnostic and prognostic biomarkers. Although data storage costs have reduced, process of capturing data using aforementioned technologies is still expensive. Moreover, the time required for the assay, from sample preparation to raw value measurement is excessive (in the order of days). There is an opportunity to reduce both the cost and time for generating such expression datasets.ResultsWe propose a framework in which complete gene expression values can be reliably predicted in-silico from partial measurements. This is achieved by modelling expression data as a low-rank matrix and then applying recently discovered techniques of matrix completion by using nonlinear convex optimisation. We evaluated prediction of gene expression data based on 133 studies, sourced from a combined total of 10,921 samples. It is shown that such datasets can be constructed with a low relative error even at high missing value rates (>50 %), and that such predicted datasets can be reliably used as surrogates for further analysis.ConclusionThis method has potentially far-reaching applications including how bio-medical data is sourced and generated, and transcriptomic prediction by optimisation. We show that gene expression data can be computationally constructed, thereby potentially reducing the costs of gene expression profiling. In conclusion, this method shows great promise of opening new avenues in research on low-rank matrix completion in biological sciences.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1106-6) contains supplementary material, which is available to authorized users.

Highlights

  • An exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq

  • We show that sparse gene expression measurements (“known” expression levels) could be used to artificially construct the gene expression dataset using non-linear convex optimisation, and report prediction results on diverse expression datasets sourced from multiple experiments

  • Methods we introduce the principals involved in modeling low-rank matrix completion and artificial construction of the gene expression dataset from known sparse expression levels

Read more

Summary

Introduction

An exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. This growth is supported by the development of new technologies that profile gene expressions in an automated manner Such technologies have significantly evolved in the past 20 years, from initially monitoring less than 50 features per slide [2] to whole genome expression analysis with new generation microarrays having more than 106 features, such as GeneChip oligonucleotide probe based arrays and high density bead arrays [3]. This evolution has persisted in the form of next-generation sequencing (NGS) methods being used to quantify RNA in a sample [4] and have proven to be advantageous in terms of performing discovery-based experiments and having a larger dynamic range.

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.