Abstract
The last 20 years of advancement in sequencing technologies have led to sequencing thousands of microbial genomes, creating mountains of genetic data. While efficiency in generating the data improves almost daily, applying meaningful relationships between taxonomic and genetic entities on this scale requires a structured and integrative approach. Currently, knowledge is distributed across a fragmented landscape of resources from government-funded institutions such as National Center for Biotechnology Information (NCBI) and UniProt to topic-focused databases like the ODB3 database of prokaryotic operons, to the supplemental table of a primary publication. A major drawback to large scale, expert-curated databases is the expense of maintaining and extending them over time. No entity apart from a major institution with stable long-term funding can consider this, and their scope is limited considering the magnitude of microbial data being generated daily. Wikidata is an openly editable, semantic web compatible framework for knowledge representation. It is a project of the Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics. We are developing a microbial specific data model, based on Wikidata’s semantic web compatibility, which represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 43 694 gene and 37 966 protein items for 21 species of bacteria, including the human pathogenic bacteria Chlamydia trachomatis. Using this pathogen as an example, we explore complex interactions between the pathogen, its host, associated genes, other microbes, disease and drugs using the Wikidata SPARQL endpoint. In our next phase of development, we will add another 99 bacterial genomes and their gene and gene products, totaling ∼900,000 additional entities. This aggregation of knowledge will be a platform for community-driven collaboration, allowing the networking of microbial genetic data through the sharing of knowledge by both the data and domain expert.
Highlights
The relatively small and non-repetitive nature of microbial genomes, coupled with the rapid advancement of sequencing technology in the last decade, have led to the generation of a staggering amount of bacterial genome records
Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics
Project is in the early stages of analyzing and cataloguing over ~200,000 environmental samples from around the world, and estimates that this will result in the sequencing of ~500,000 reconstructed microbial genomes [1]
Summary
The relatively small and non-repetitive nature of microbial genomes, coupled with the rapid advancement of sequencing technology in the last decade, have led to the generation of a staggering amount of bacterial genome records. Our model follows a hierarchical taxonomy ranking scheme with the microbial species assigned to a Wikidata item (i.e. Chlamydia trachomatis #Q131065) defined by the core properties ‘NCBI Taxonomy ID’ (P685). This information can be accessed through the various APIs offered by Wikidata (https://www.wikidata.org/w/api.php, https://query.wikidata.org/). Revisiting the example question regarding organisms that are likely to be related to the persistence of chlamydial infections, we can ask what microbes are located in the female urogential tract and capable of generating indole as follows (Figure 6)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have