Abstract
Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as ‘single exon genes’ (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl
Highlights
In most Eukaryotic genes, the coding sequence (CDS) is interrupted by noncoding introns that are removed by splicing to generate mRNA
There is no statistical correlation between the number of total genes with 50- and/or 30-untranslated regions (UTRs) [29] and single exon genes (SEGs) percentage (R2 1⁄4 0.5009) among the analyzed mammalian genomes (Supplementary Figure S1)
SinEx DB complements existing databases such as retrogene DB [11] and pseudogene DB [27]. It could be used as a comparative platform for annotating single exon coding sequences in mammalian genomes
Summary
In most Eukaryotic genes, the coding sequence (CDS) is interrupted by noncoding introns that are removed by splicing to generate mRNA. We extend and complement the IGD and PIGD databases by creating SinEx DB, a publicly available searchable database of predicted SEGs from 10 completely sequenced mammalian genomes, namely: human, chimpanzee, rhesus macaque, mouse, rat, dog, horse, pig, cow and opossum. The sequences of annotated mammalian genomes, assembled at a chromosome level, were downloaded from GenBank [23] at the FTP site on the NCBI web page (ftp:// ftp.ncbi.nlm.nih.gov/genomes/), including: human (ref_GRCh37.p5), chimpanzee (ref_Pan_troglodytes-2.1.4), rhesus macaque (ref_Mmul_051212), mouse (ref_MGSCv37), rat (ref_RGSC_v3.4), dog (ref_CanFam2.0), horse (ref_EquCab2.0), pig (ref_Sscrofa10), cow (ref_Btau_4.2) and opossum (ref_MonDom5).
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have