Abstract

Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu

Highlights

  • Annotating functional elements in the human genome is a major goal in human genetics

  • We present GenoCanyon, a whole-genome annotation tool based on unsupervised statistical learning

  • The prediction results in these regions showed that GenoCanyon is capable of detecting functional regions in the human genome, which is a unique feature most existing whole-genome annotation tools do not have

Read more

Summary

Introduction

Annotating functional elements in the human genome is a major goal in human genetics. High-throughput experiments, e.g. the ENCODE project[7], suggest that a large fraction of the human genome are functionally relevant All of this evidence suggests the importance and need for extending the annotation tools from the coding regions to the entire human genome. Prediction of deleteriousness does not cover every aspect of functional annotation The potential of these variant classifiers in understanding the genomic architecture on a large scale and in detecting regulatory elements such as cis-regulatory modules remains to be thoroughly investigated. As for choosing between a supervised approach, where some gold standard datasets are needed to train the model, and an unsupervised approach, where no labeled data are used, we focus on developing an unsupervised learning method in this article This is because current supervised-learning-based annotation tools suffer from highly biased training data, which is largely due to our limited knowledge of non-coding regions. Its flexible and generalizable statistical framework could benefit future applications

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.