Abstract

BackgroundMany genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory program of the cell. Enhancers are short regulatory sequences in the non-coding part of the genome that are essential for the proper regulation of transcription. While the experimental methods for identification of such sequences are improving every year, our understanding of the rules behind the enhancer activity has not progressed much in the last decade. This is especially true in case of tissue-specific enhancers, where there are clear problems in predicting specificity of enhancer activity.ResultsWe show a random-forest based machine learning approach capable of matching the performance of the current state-of-the-art methods for enhancer prediction. Then we show that it is, similarly to other published methods, frequently cross-predicting enhancers as active in different tissues, making it less useful for predicting tissue specific activity. Then we proceed to show that the problem is related to the fact that the enhancer predicting models exhibit a bias towards predicting gene promoters as active enhancers. Then we show that using a two-step classifier can lead to lower cross-prediction between tissues.ConclusionsWe provide whole-genome predictions of human heart and brain enhancers obtained with two-step classifier.

Highlights

  • Many genetic diseases are caused by mutations in non-coding regions of the genome

  • The data collected by the ENCODE or Epigenome Roadmap [5] are invaluable as a source for computational attempts at making models that would be predictive beyond the collected data and perhaps eventually help defining the principles of tissue-specific action of regulatory elements

  • In this work we report our findings based on applying Random Forest classification to the problem of enhancer prediction in the human genome

Read more

Summary

Introduction

Many genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory program of the cell. While the exact molecular mechanism of enhancer-promoter interaction remains a field of active study, we have accumulated a large body of examples. Mapping all these elements using experimental techniques is currently completely unfeasible, as many celltypes are too difficult to obtain in large quantities required. The Author(s) BMC Medical Genomics 2017, 10(Suppl 1): for experimental assessment of enhancer activity This leads to a situation, where we have hundreds of well documented examples of regulatory elements functional in a certain context (i.e. cell-type, developmental time) determined by a certain method (enhancer reporter assays [3], STARR-Seq [4], luciferase assays, in-situ hybridization etc). The data collected by the ENCODE or Epigenome Roadmap [5] are invaluable as a source for computational attempts at making models that would be predictive beyond the collected data and perhaps eventually help defining the principles of tissue-specific action of regulatory elements

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call