Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data

Ravi Gupta,Anirban Bhattacharyya,Sharmistha Pal,Ramana V Davuluri,Priyankara Wikramasinghe,Francisco A Perez

doi:10.1186/1471-2105-11-s1-s65

Ravi Gupta, Anirban Bhattacharyya + Show 4 more

Open Access

https://doi.org/10.1186/1471-2105-11-s1-s65

Copy DOI

Abstract

BackgroundUse of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.MethodsWe trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.ResultsWe present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.ConclusionOur new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.

Highlights

Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon
The major challenge in annotating promoters based on RNA polymerase II (Pol-II) enriched regions/ peaks is the spread of the transcribing polymerase throughout the gene and as a result all genomic regions bound by RNA Pol-II are enriched in these experiments, producing significantly large number of enriched peaks after the initial statistical analysis [9]
It is clear that Bagging, LogitBoost and Random Forest perform more or less similar and slightly better than Rotational Forest, with overall positive predictive value greaten than 95 and correlation coefficient greaten than 0.9

Summary

Introduction

Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. These methods produce enrichment near the gene promoters and inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. The development of chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) technology and massively parallel sequencing (ChIPseq) has enabled genome-wide identification of promoters, using antibody against RNA polymerase II (Pol-II) in different cells or tissues [6,7]. It is not possible to identify promoters with high confidence based on RNA Pol-II ChIP-chip/seq enrichment data alone, warranting development of better classification algorithms for accurate identification of promoter related Pol-II enriched regions

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2010
Citations: 68	License type: cc-by

R Discovery Prime

R Discovery Prime

Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Citations
Nathan Blow
BioTechniques | VOL. 42
Nathan BlowNathan Blow
01 Mar 2007
BioTechniques | VOL. 42

The functional consequences of alternative promoter use in mammalian genomes
Ramana V Davuluri ... Tim H.-M Huang
Trends in Genetics | VOL. 24
Ramana V Davuluri, et. al.Ramana V Davuluri ... Tim H.-M Huang
07 Mar 2008
Trends in Genetics | VOL. 24

Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq
Hao Sun ... Louise C Showe
Nucleic Acids Research | VOL. 39
Hao Sun, et. al.Hao Sun ... Louise C Showe
14 Sep 2010
Nucleic Acids Research | VOL. 39

Decision letter: Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates
Anne Ferguson-Smith
-
Anne Ferguson-SmithAnne Ferguson-Smith
10 Dec 2012
10 Dec 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics