Preserving sequence annotations across reference sequences.

Zuotian Tatum,Peter Em Taschner,Andrew P Gibson,Erik A Schultes,Jeroen Fj Laros,Marco Roos,Mark Thompson

doi:10.1186/2041-1480-5-s1-s6

Zuotian Tatum, Peter Em Taschner + Show 5 more

Open Access

https://doi.org/10.1186/2041-1480-5-s1-s6

Copy DOI

Abstract

BackgroundMatching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used. This makes the integration of annotations from different sources difficult and error prone.ResultsAs part of our effort to create linked data for interoperable sequence annotations, we present an RDF data model for sequence annotation using the ontological framework established by the OBO Foundry ontologies and the Basic Formal Ontology (BFO). We defined reference sequences as the common domain of integration for sequence annotations, and identified three semantic relationships between sequence annotations. In doing so, we created the Reference Sequence Annotation to compensate for gaps in the SO and in its mapping to BFO, particularly for annotations that refer to versions of consensus reference sequences. Moreover, we present three integration models for sequence annotations using different reference assemblies.ConclusionsWe demonstrated a working example of a sequence annotation instance, and how this instance can be linked to other annotations on different reference sequences. Sequence annotations in this format are semantically rich and can be integrated easily with different assemblies. We also identify other challenges of modeling reference sequences with the BFO.

Highlights

Matching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used
We started by deriving our Resource Description Framework (RDF) model from the Browser Extensible Data (BED) format: (i) we identified the desired upper ontological framework for the domain of interest; (ii) we converted data in the BED track to RDF triples; (iii) we further transformed the resulting triples by adding class definitions and ontology mappings to the final model
We demonstrated a working data model of sequence annotations that can be preserved across different reference sequence assemblies

Summary

Introduction

Matching and comparing sequence annotations of different reference sequences is vital to genomics research, yet many annotation formats do not specify the reference sequence types or versions used. This makes the integration of annotations from different sources difficult and error prone. Sequence annotations and their relationship with reference sequences Sequence annotations are information artifacts that add biologically meaningful information to specific locations on genomic, gene, transcript or protein sequences. Variants are annotated with descriptions of sequence variations and positions according to the chosen transcript sequence. Disambiguation of the variant description is an essential step in the context of data integration and preservation

Objectives

Results

Conclusion