Abstract
We assembled the sequences from deep RNA sequencing experiments by the Genotype-Tissue Expression (GTEx) project, to create a new catalog of human genes and transcripts, called CHESS. The new database contains 42,611 genes, of which 20,352 are potentially protein-coding and 22,259 are noncoding, and a total of 323,258 transcripts. These include 224 novel protein-coding genes and 116,156 novel transcripts. We detected over 30 million additional transcripts at more than 650,000 genomic loci, nearly all of which are likely nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells. The CHESS database is available at http://ccb.jhu.edu/chess.
Highlights
Scientists have been attempting to estimate the number of human genes for more than 50 years, dating back to 1964 [1]
In the decade preceding the initial publication of the human genome, multiple estimates were made based on sequencing of short messenger RNA fragments, and most of these estimates fell in the range of 50,000–100,000 genes [2,3,4,5]
To validate the coding potential of novel loci identified in this study, we searched the unmatched spectra from 30 human tissue/cell types against the novel predicted open reading frame (ORF) described in this study
Summary
Scientists have been attempting to estimate the number of human genes for more than 50 years, dating back to 1964 [1]. Novel transcripts may in some cases represent novel combinations of exons—e.g., exon-skipping events—but in many cases, they include novel splice sites that create new exons and introns To answer this question, we compared all of the protein coding and lncRNA transcripts in CHESS (version 2.1), RefSeq (release 108), and GENCODE (v28) to determine the number of (a) introns and (b) transcripts that were shared among all combinations of the three databases. To validate the coding potential of novel loci identified in this study, we searched the unmatched spectra from 30 human tissue/cell types (see the “Methods” section) against the novel predicted ORFs described in this study Peptides identified in this search that were either identical to annotated proteins or mapped with a single mismatch were discarded. We note that the abundance of these novel transcripts was very low and the ORFs are relatively short, both of which may explain the small number of identified peptides
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.