Abstract
Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at http://cbrc.kaust.edu.sa/deep/.
Highlights
Transcription regulation in human genes is a complex process [1,2]
To explore the effectiveness of individual models trained on information form one cell line to predict enhancers in other cell lines, we tested the performance of Gm12878, H1hesc, Hep and Huvec ensemble classifiers on data from Hela and K562
A more thorough analysis of the generalization capabilities of individually deployed models revealed that few cell lines share a lot of the common properties and generalization becomes easier for such cases
Summary
Promoters are cis-regulatory regions, which serve as anchor points for recruiting multiprotein complexes required for transcription. These regions have been extensively studied, their underlying transcriptional mechanism is not yet fully understood [3]. In contrast to proximal elements, distal elements are not located near to the genes whose activity they affect, and can be located 20 kb or further away, or even can be located at different chromosomes. Their functional mechanism appears to be independent of the upstream/downstream location of the genes they target. Silencers, repressors and insulators have practically negative effects on the cellular transcriptional output either through recruitment of transcriptional repressor proteins [7], or by preventing the spread of heterochromatin [8]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.