Knowledge Extraction from Specimen-Derived Data from GenBank to Enrich Biodiversity Information

Takeru Nakazato

doi:10.3897/biss.5.73787

Abstract

DNA barcoding and environmental DNA (eDNA) are increasing the need for the utilization of gene sequences in the field of biodiversity. GBIF (Global Biodiversity Information Facility) and GGBN (Global Genome Biodiversity Network) are taking action on the treatment of gene sequences in the field of biodiversity (Finstad et al. 2020). Gene sequences have been collected and published by INSDC (International Nucleotide Sequence Database Collaboration) for over 30 years (Arita et al. 2020). Biodiversity information has been collected using standards such as Darwin Core (Wieczorek et al. 2012), but INSDC gene sequences are stored in their own format. In the field of bioinformatics, researchers are also organizing the BioHackathon series, notably the NBDC/DBCLS BioHackathon and the spin-off Biohackathon Europe, to standardize data through the Semantic Web (Garcia Castro et al. 2021, Vos et al. 2020), but the linkage with biodiversity information has just begun. In this study, as an example of linking gene sequence information with biodiversity information, I attempted to construct an infrastructure for knowledge extraction by utilising gene sequence entries derived from museum specimens from GenBank (Sayers et al. 2020). I have previously surveyed the BOLD (The Barcode of Life Data System) (Ratnasingham and Hebert 2007) IDs listed in GenBank (Nakazato 2020). I downloaded the fish and insect data from the GenBank FTP (file transfer protocol) site. Then I extracted the descriptions in the "specimen_voucher" field and obtained 749,627 (28% of the fish entries in GenBank) and 1,621,890 (13%) specimen IDs, respectively. I also extracted from the "note" field approximately 1000 entries describing the type of the specimen, such as "holotype", "lectotype", and "paratype". These extracts include descriptions written in natural language. NCBI (National Center for Biotechnology Information) publishes the BioCollections database (Sharma et al. 2019), and these data may be able to refine the description. In the future, I plan to map these extracted IDs to the collection IDs in the biodiversity information database. This will enable us to enrich the biodiversity information with GenBank descriptions, for example, by adding articles listed in GenBank as references to the specimen data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Biodiversity Information Science and Standards	Publication Date: Sep 1, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Knowledge Extraction from Specimen-Derived Data from GenBank to Enrich Biodiversity Information

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards

Lead the way for us

Similar Papers

Current situation of DNA Barcoding data in biodiversity and genomics databases and data integration for museomics
Takeru Nakazato
Biodiversity Information Science and Standards | VOL. 3
Takeru NakazatoTakeru Nakazato
18 Jun 2019
Biodiversity Information Science and Standards | VOL. 3

Survey of Species Covered by DNA Barcoding Data in BOLD and GenBank for Integration of Data for Museomics
Takeru Nakazato
Biodiversity Information Science and Standards | VOL. 4
Takeru NakazatoTakeru Nakazato
29 Sep 2020
Biodiversity Information Science and Standards | VOL. 4

A Challenge to Integrate Bioinformatics and Biodiversity Informatics Data as Museomics
Takeru Nakazato
Biodiversity Information Science and Standards | VOL. 2
Takeru NakazatoTakeru Nakazato
22 May 2018
Biodiversity Information Science and Standards | VOL. 2

EDNAqua-Plan—Standardisation Overview for eDNA Sequencing of Aquatic Organisms and the Downstream Data Ecosystem
Joana Pauperio ... Jorge Moutinho
Biodiversity Information Science and Standards | VOL. 8
Joana Pauperio, et. al.Joana Pauperio ... Jorge Moutinho
28 Aug 2024
Biodiversity Information Science and Standards | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Knowledge Extraction from Specimen-Derived Data from GenBank to Enrich Biodiversity Information

Abstract

Talk to us

Similar Papers

More From: Biodiversity Information Science and Standards