Abstract
GenBank® is a comprehensive database that contains publicly available DNA sequences for more than 165 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
Highlights
GenBank [1] is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotation, built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD.NCBI builds GenBank primarily from the submission of sequence data from authors and from the bulk submission of expressed sequence tag (EST), genome survey sequence (GSS) and other high-throughput data from sequencingSEQUENCE-BASED TAXONOMYDatabase sequences are classified and can be queried using a comprehensive sequence-based taxonomy developed by NCBI in collaboration with EMBL and DNA Databank of Japan (DDBJ) and with the valuable assistance of external advisers and curators
Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, bibliographic references and a table of features listing areas of biological significance, such as coding regions and their protein translations, transcription units, repeat regions, and sites of mutations or modifications
The data in dbEST is processed further to produce the UniGene database of more than 700 000 gene-oriented sequence clusters representing over 50 organisms, as described in detail previously [4]
Summary
GenBank continues to grow at an exponential rate with 7.9 million new sequences added over the past 12 months. As of Release 143 in August 2004, GenBank contained over 41.8 billion nucleotide bases from 37.3 million individual sequences. Complete genomes (http://www.ncbi.nlm.nih.gov/ Genomes/index.html) represent a growing portion of the database, with over 50 of more than 180 complete microbial genomes in GenBank deposited over the past year. The number of eukaryote genomes for which coverage and assembly are good continues to increase as well, with over 20 such assemblies available, including that of the reference human genome
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have