DNA sequencing of museum specimens, also known as museomics, provides new insights into the study of biodiversity, including taxonomy, phylogeny, and environmental studies. Also, sequencing specimens have led to the rediscovery of extinct species (Suzuki et al. 2016), identification of related species (Waku et al. 2016), and analysis of ancient DNA (Kanzawa-Kiriyama et al. 2016). Nucleotide sequence data have been collected for more than 30 years under the framework of the International Nucleotide Sequence Database Collaboration (INSDC) by three institutes, namely, National Center for Biotechnology Information, US (NCBI), European Bioinformatics Institute (EBI), and DNA Data Bank of Japan (DDBJ) (Arita et al. 2020). NCBI has collated a database of sequence data, GenBank, which contains approximately 494 million sequences as of April 2022 (Sayers et al. 2021). In fact, GenBank is designed with qualifiers to describe various types of biodiversity information such as "/specimen_voucher", "/lat_lon" (latitude and longitude) and "/collection_date". Also, INSDC now requires that all submissions include the sampling location and date (INSDC 2023). I surveyed the biodiversity information assigned to GenBank records to determine the potential of GenBank as a biodiversity resource. I downloaded all GenBank data as of August 2023 from the FTP site. The “/specimen_voucher” qualifier was introduced to describe specimen ID in Release 104 in December 1997. This qualifier was designed to fill the value in free text: for example, /specimen_voucher="Smith s. n. 4-IV-1995 (U. S. Natl. Herbarium)". After Release 162 in October 2007, a method of writing with a structured value of "[<institution-code>: [<collection-code>:]] <specimen_id>" was added (institution-code and collection-code are optional). There are 527,215 records (37.8%) with "/specimen_voucher" qualifier for fish, 3,096,112 records (40.3%) for insects, 1,505,556 records (39.0%) for flowering plants. But fewer than 10% of records have specimen IDs listed using this structured description. To utilize these ambiguous specimen IDs in GenBank, these IDs may need to be cleansed using databases such as NCBI BioCollections, GRSciColl (Global Registry of Scientific Collections) or AI to map them to IDs in databases rich in specimen information such as those of the Global Biodiveristy Information Facility (GBIF) and Barcode of Life System (BOLD). In GenBank, the BOLD ID is listed in the /db_xref qualifier in the “Features” field as the ID of the external database. The 70% of insect sequence data with a specimen ID in the /specimen_voucher qualifier are also assigned a BOLD ID (Nakazato and Jinbo 2022). The correspondence between specimen IDs in biodiversity information databases such as GBIF and specimen IDs in GenBank is expected to further enhance the value of museum specimens. In addition, GenBank provides the /type_material qualifier for describing the type of voucher (e.g., holotype of Asphondylia bursicola). In GenBank insect data, there were over 2,000 records for type material, and approximately 450 species were mentioned, including 269 for holotypes. We found approximately 3,000 records with type information by including “/notes” and “/specimens_voucher” qualifiers in addition to “/type_material”. Thus, GenBank has potential as a biodiversity information resource, but for more effective use, data mining and linkage with other specimen-based biodiversity databases are essential.
Read full abstract