National center for biotechnology information viral genomes project.

Scott Federhen,Vyvy Pham,Roman Tatusov,Sergei Resenchuk,Yiming Bao,Detlef Leipe,Mikhail Rozanov,Tatiana Tatusova

doi:10.1128/jvi.78.14.7291-7298.2004

Abstract

The Viral Genomes Project aims to provide molecular standards for viral genomic research. The project has produced over 1,600 records for more than 1,200 different species. The National Center for Biotechnology Information (NCBI) provides access to this data through the Entrez search and retrieval engine and offers visualization of the sequence information at various levels of detail. Taxonomically organized displays, precomputed sequence comparison data, and direct access to analytical tools provide researchers with the ability to analyze and compare viral genomes and proteomes in a fast and convenient manner. The Viral Genomes Project is a collaborative effort between NCBI staff and many dedicated scientists worldwide. The URL for the database is http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html. As the number of viral records in the public sequence databases (GenBank, EMBL, and DDBJ) grows, retrieving a viral genomic sequence of interest with associated information is becoming increasingly complex. High redundancy in the databases is a common problem for all organisms; in the case of viruses, however, the large number of available strains, isolates, and mutants further exacerbates the problem. For example, a search of Entrez Nucleotide currently retrieves more than 95,500 records for Human immunodeficiency virus 1 (HIV-1) and more than 22,500 records for Hepatitis C virus (HCV) alone; the total number of viral nucleotide records exceeds 220,000. Among these are both partial and complete genomic sequences, including partial sequences marked as a complete genome by submitters. Historically, sequence databases were merely archives of sequences directly submitted by users. Although a stricter submission procedure has been applied in recent years and therefore the quality of sequence records has greatly improved, a significant number of records are still underannotated, and the information in the old sequence records is often outdated. Furthermore, viral genomes are remarkably variable, consisting of either single-stranded or double-stranded DNA or RNA in either linear or circular form and comprising one or more segments. This variability makes viral records especially prone to inaccuracies in molecular information annotation. To cope with these problems, NCBI has created the Viral Genomes Project as a part of the NCBI Genomes Project (19). Only complete or, occasionally, nearly complete viral genomic sequences missing only nontranslated portions (usually the ends of a genomic molecule) are being collected for this project, thereby greatly reducing redundancy. All available complete viral genomic sequences are being collected in order to faithfully represent the great genome variability found in many viruses. For example, 314 complete genome sequences of HIV-1 from various strains and isolates are included in the Entrez Genome collection. But only one sequence ({type:entrez-nucleotide,attrs:{text:NC_001802,term_id:9629357}}NC_001802) has been selected as a reference (RefSeq) to serve as a molecular standard. RefSeq records are manually curated to correct and update content in the original sequence records, which often involves consultations with the original submitters and/or other outside experts. The collection of preselected reference sequences greatly facilitates comparison of the genomes of different viruses. As of December 2003, the Viral Genomes Project contained 1,677 viral reference genomic sequences representing 1,223 virus species, which make a significant contribution to the NCBI RefSeq collection (13). Figure Figure11 shows the growth of the viral RefSeq collection during the past 3 years. FIG. 1. The growth of NCBI's Viral Genomes Project. The bars represent the numbers of new and all viral genome reference sequences in each quarter. While a number of databases provide information on viral sequences, most of them are limited to certain families or groups (reviewed in references 4, 3, 6, and 12; http://www.dpvweb.net/index.php). The most comprehensive and well-established viral database, ICTVdB, provides “searchable descriptions of virus isolates, species, genera, families, orders; images of many viruses; and links to genomic and protein databanks” (2). ICTVdB has been a primary resource for information about biological properties of viruses. It plays a major role in viral taxonomic classification, on which our project relied heavily. ICTVdB does present links to viral sequences, but these sequences are original records from public sequence databases and therefore may contain inaccurate or outdated information. The Viral Genomes Project described here is the first comprehensive resource that provides access to the curated set of complete viral genomes in an easily navigable way and offers a collection of tools and precomputed results which greatly facilitate viral genome analysis. These precomputed analyses and tools include the global alignment of genome neighbors, available in both text and graphical forms; viral protein clusters (VOGs), (putative) functional and evolutionary groups of viral proteins derived from RefSeq genomes (which eliminates redundancy) and classified by sequence similarity; convenient VOG displays, including those integrated with the Conserved Domain Database (CDD), an NCBI collection of conserved protein domains; and a BLAST search against a selected set of viral proteins. To start exploring the Viral Genomes resources, go to http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html.

Full Text