Abstract

MotivationSuccessful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types (‘biosamples’) and a list of possible high-throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask ‘Which experiments should ENCODE perform next?’ResultsWe demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications of the facility location function, including a novel submodular–supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure.Availability and implementationOur method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Experimental characterization of the genomic and epigenomic landscape of a human cell line or tissue (“biosample”) is expensive but can potentially yield valuable insights into the molecular basis for development and disease

  • Several approaches have been proposed to address this challenge. Some scientific consortia, such as GTEx and ENTEX, aim to completely fill in a submatrix of selected assays and selected biosamples. Other consortia, such as the Roadmap Epigenomics Mapping Consortium [1] and ENCODE [2], adopted a roughly “L”-shaped strategy, in which consortium members focused on carrying out many assays in a small set of high-priority biosamples, and some assays were carried out over a much larger set of biosamples

  • We first generated imputations of epigenomic and transcriptomic experiments using a recently developed imputation approached based on deep tensor factorization, named Avocado

Read more

Summary

Introduction

Experimental characterization of the genomic and epigenomic landscape of a human cell line or tissue (“biosample”) is expensive but can potentially yield valuable insights into the molecular basis for development and disease. We cannot afford to fill in an experimental data matrix in which rows correspond to types of assays and columns correspond to biosamples. Several approaches have been proposed to address this challenge. Some scientific consortia, such as GTEx and ENTEX, aim to completely fill in a submatrix of selected assays and selected biosamples. While the imputation strategy can relatively complete the entire matrix, a drawback is that the imputed data is potentially less trustworthy than actual experimental data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.