Abstract
ObjectiveA well-known limit of genome browsers is that the large amount of genome and gene data is not organized in the form of a searchable database, hampering full management of numerical data and free calculations. Due to the continuous increase of data deposited in genomic repositories, their content revision and analysis is recommended. Using GeneBase, a software with a graphical interface able to import and elaborate National Center for Biotechnology Information (NCBI) Gene database entries, we provide tabulated spreadsheets updated to 2019 about human nuclear protein-coding gene data set ready to be used for any type of analysis about genes, transcripts and gene organization.ResultsComparison with previous reports reveals substantial change in the number of known nuclear protein-coding genes (now 19,116), the protein-coding non-redundant transcriptome space [now 59,281,518 base pair (bp), 10.1% increase], the number of exons (now 562,164, 36.2% increase) due to a relevant increase of the RNA isoforms recorded. Other parameters such as gene, exon or intron mean and extreme length appear to have reached a stability that is unlikely to be substantially modified by human genome data updates, at least regarding protein-coding genes. Finally, we confirm that there are no human introns shorter than 30 bp.
Highlights
A well-known limit of genome browsers [1,2,3] is that the large amount of data they provide about human genome and genes is not organized in the form of a searchable database [4], hampering a full management of numerical data and free calculations on data subsets
We have previously shown that GeneBase, a software with a graphical interface able to import and elaborate data available in the National Center for Biotechnology Information (NCBI) Gene database, allows users to perform original searches, calculations and analyses of the main geneassociated meta-information [5], and since the release of GeneBase 1.1, it can provide descriptive statistical summarization such as median, mean, standard deviation and total for many quantitative parameters
Database searching and export In order to provide a curated set of updated statistics regarding human nuclear protein-coding genes and transcripts through GeneBase 1.1 Human, we considered only NCBI Gene records retrieved by searching for protein-coding gene type, with REVIEWED or VALIDATED Reference Sequence (RefSeq) gene status, with at least one REVIEWED or VALIDATED transcript, excluding records annotated as “not in current annotation release” records (Genome_Annotation_Status field)
Summary
Comparison with previous reports reveals substantial change in the number of known nuclear proteincoding genes ( 19,116), the protein-coding non-redundant transcriptome space [ 59,281,518 base pair (bp), 10.1% increase], the number of exons ( 562,164, 36.2% increase) due to a relevant increase of the RNA isoforms recorded. Other parameters such as gene, exon or intron mean and extreme length appear to have reached a stability that is unlikely to be substantially modified by human genome data updates, at least regarding protein-coding genes. We confirm that there are no human introns shorter than 30 bp
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have