Abstract
Genome-wide, high-throughput methods for transcription start site (TSS) detection have shown that most promoters have an array of neighboring TSSs where some are used more than others, forming a distribution of initiation propensities. TSS distributions (TSSDs) vary widely between promoters and earlier studies have shown that the TSSDs have biological implications in both regulation and function. However, no systematic study has been made to explore how many types of TSSDs and by extension core promoters exist and to understand which biological features distinguish them. In this study, we developed a new non-parametric dissimilarity measure and clustering approach to explore the similarities and stabilities of clusters of TSSDs. Previous studies have used arbitrary thresholds to arrive at two general classes: broad and sharp. We demonstrated that in addition to the previous broad/sharp dichotomy an additional category of promoters exists. Unlike typical TATA-driven sharp TSSDs where the TSS position can vary a few nucleotides, in this category virtually all TSSs originate from the same genomic position. These promoters lack epigenetic signatures of typical mRNA promoters and a substantial subset of them are mapping upstream of ribosomal protein pseudogenes. We present evidence that these are likely mapping errors, which have confounded earlier analyses, due to the high similarity of ribosomal gene promoters in combination with known G addition bias in the CAGE libraries. Thus, previous two-class separations of promoter based on TSS distributions are motivated, but the ultra-sharp TSS distributions will confound downstream analyses if not removed.
Highlights
The recruitment of the pre-initiation complex (PIC) to the transcription start site (TSS) is a complex interplay of many factors, including binding of transcription factors and epigenetic signals such as nucleosome occupancy and modification of histone tails [1]
We focus on the distribution of Cap Analysis of Gene Expression (CAGE) tags since this is the largest data set to date from multiple tissues: in particular, we use the FANTOM3 CAGE data from 22 tissues in mouse, provided by Carninci et al [10]
In this study we systematically investigated TSS distributions (TSSDs) to see how many stable groupings of such distributions that the data supports and compared these groups to previous classifications
Summary
The recruitment of the pre-initiation complex (PIC) to the transcription start site (TSS) is a complex interplay of many factors, including binding of transcription factors and epigenetic signals such as nucleosome occupancy and modification of histone tails [1]. The completion of several genomes of higher eukaryotes has prompted the development of accurate genomewide methods based on capturing capped transcripts and sequencing the first 20–30 nt from the 59 end of these using high-throughput DNA sequencers. Examples of these include Cap Analysis of Gene Expression (CAGE) [2], massively parallel Paired End Tag (PET)-tagging [3] and Oligocapping [4]. The number of tags mapping to a certain genomic region can be regarded as a measure of the amount of transcription initiation from this region, and these techniques can be used to identify promoters that are only used in certain tissues [6]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.