Abstract

Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. This opens up an increasing amount of data on wild type organisms, which enables the building of large biodiversity knowledge graphs comprising, inter alia, sequence, trait and occurrence data. Knowledge graphs model information in the form of entities and their relationships expressed in good practice as ontology-based annotations. Based on ontological descriptions, semantic similarity analysis makes linking of wild type data to genomic and proteonomic data of model organisms possible and thus supports knowledge discovery of crop wild relatives and underutilized species of interest for medicine, breeding and agriculture. Since classical similarity measurements focus on recording differences between character states (aiming to describe disease phenotypes), but not the character states in the sense of trait variations itself, new methods for similarity search are required. Machine learning algorithms operate on feature vectors, which are numeric representations of data (images, class labels etc) in n-dimensional vector space. We established a machine learning based workflow for similarity search on biodiversity entities using feature learning on ontologies and an associated RDF knowledge graph to project structured trait data into vector space. Vectors are then compared applying a similarity function (e.g. cosine similarity) to determine similarity between taxa based on trait semantics. We will present an application example of machine learning on biodiversity knowledge graphs using a pipeline built upon OPA2Vec, a method to generate feature vectors from the logical content of ontologies (Smaili et al. 2018), to successfully cluster plant species for life form and ecotype (e.g. tree vs. perennial plant) on the basis of their annotations with the Flora Phenotype Ontology (Hoehndorf et al. 2016).

Highlights

  • Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. This opens up an increasing amount of data on wild type organisms, which enables the building of large biodiversity knowledge graphs comprising, inter alia, sequence, trait and occurrence data

  • Semantic similarity analysis makes linking of wild type data to genomic and proteonomic data of model organisms possible and supports knowledge discovery of crop wild relatives and underutilized species of interest for medicine, breeding and agriculture

  • Machine learning algorithms operate on feature vectors, which are numeric representations of data in n-dimensional vector space

Read more

Summary

Introduction

Mass biodiversity data from scientific collections will be provided by world-wide digitization efforts like iDigBio in the U.S and DiSSCo in Europe. Corresponding author: Claus Weiland (cweiland@senckenberg.de) Received: 10 Jun 2019 | Published: 13 Jun 2019 Citation: Weiland C, Kulmanov M, Schmidt M, Hoehndorf R (2019) A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call