Abstract

BackgroundNucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples.DescriptionWe have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations.ConclusionsOur ontology allows users to uniformly describe – and potentially merge – sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.

Highlights

  • Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level

  • Data sources using Feature Annotation Location Description Ontology (FALDO) can prospectively be retrieved using federalised SPARQL protocol and RDF query language (SPARQL) queries against public SPARQL endpoints and/or local private triple stores

  • There are many different conventions for storing genomic data and its annotations in plain text flat file formats such as Generic Feature Format version 3 (GFF3), Genome Variation Format (GVF) [3], Gene Transfer Format (GTF) and Variant Call Format (VCF), and more structured domain specific formats such as those from international nucleotide sequence database collaboration (INSDC) (International Nucleotide Sequence Database Collaboration) or UniProt, but none are flexible enough to discuss all aspects of genetics or proteomics

Read more

Summary

Introduction

Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Describing regions of biological sequences is a vital part of genome and protein sequence annotation, and in areas beyond this such as describing modifications related to DNA methylation or glycosylation of proteins. Such regions range from one amino acid (e.g. phosphorylation sites in singalling cascades) to multi megabase contigs mapped to a complete genome. Such annotation has been discussed in biological literature since at least 1949 [1] and recorded in biological databases since the first issue of the Atlas of Protein Sequence and Structure [2] in 1965. The fundamental designs of these formats are inconsistent, for example both zero-based and one-based counting standards exist, a regular source of off-by-one programming errors, which experienced bioinformaticians learn to look out for.

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call