Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

Jessica H Fong,Kim D Pruitt,Terence D Murphy

doi:10.1186/1471-2164-14-654

Jessica H Fong, Kim D Pruitt + Show 1 more

Open Access

https://doi.org/10.1186/1471-2164-14-654

Copy DOI

Abstract

BackgroundAdvances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI’s RefSeq project and subsequently processed by NCBI’s eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.ResultsWe assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian “core proteins” for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].ConclusionsOur multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.

Highlights

Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in National Center for Biotechnology Information (NCBI)’s Reference Sequences (RefSeq) project and subsequently processed by NCBI’s eukaryotic annotation pipeline
Requiring only sequence data allows the process to be applied to the widest number of proteins, transcripts, and genome annotations available in NCBI databases
Our results provide a summary of specific, consistent differences in particular proteins, which may be valuable for internal review to improve the manually curated dataset and to identify targets for improvement of NCBI’s genome annotation pipeline

Summary

Introduction

Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI’s RefSeq project and subsequently processed by NCBI’s eukaryotic annotation pipeline. The RefSeq project has not previously assessed annotation trends across organisms or over time To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features. Gnomon is the core computational tool which integrates analysis of transcript and protein alignments and ab initio data to generate a set of annotation models which are further filtered before being selected for final RefSeq genomic annotation [8] This method tends to produce gene and protein annotations that match known ones, it results in annotation differences that are supported by additional data available for one genome over another, or differences that are influenced by insufficient samespecies transcript evidence, genome assembly issues, inexact exon definitions based on protein alignments, or limitations of the prediction method

Methods

Results

Conclusion