Abstract

Genetics research is increasingly focusing on mining fully sequenced genomes and their annotations to identify the causal genes associated with traits (phenotypes) of interest. However, a complex trait is typically associated with multiple quantitative trait loci (QTLs), each comprising many genes, that can positively or negatively affect the trait of interest. To help breeders in ranking candidate genes, we developed an analytical platform called pbg-ld that provides semantically integrated geno- and phenotypic data on Solanaceae species. This platform combines both unstructured data from scientific literature and structured data from publicly available biological databases using the Linked Data approach. In particular, QTLs were extracted from tables of full-text articles from the Europe PubMed Central (PMC) repository using QTLTableMiner++ (QTM), while the genomic annotations were obtained from the Sol Genomics Network (SGN), UniProt and Ensembl Plants databases. These datasets were transformed into Linked Data graphs, which include cross-references to many other relevant databases such as Gramene, Plant Reactome, InterPro and KEGG Orthology (KO). Users can query and analyze the integrated data through a web interface or programmatically via the SPARQL and RESTful services (APIs). We illustrate the usability of pbg-ld by querying genome annotations, by comparing genome graphs, and by two biological use cases in Jupyter Notebooks. In the first use case, we performed a comparative genomics study using pbg-ld to compare the difference in the genetic mechanism underlying tomato fruit shape and potato tuber shape. In the second use case, we developed a seamlessly integrated workflow that uses genomic data from pbg-ld knowledge graphs and prioritization pipelines to predict candidate genes within QTL regions for metabolic traits of tomato.

Highlights

  • The availability of annotated reference genome assemblies for several crop species including tomato [1], potato [2], brassica [3] and cucumber [4] has enabled plant breeders and researchers to elucidate a trait’s linkage to a genomic location(s)

  • This QTL is marked by flanking markers C2_At2g14260 and TG400 on chromosome 11, for which pbg-ld finds the list of all genes in this region

  • Genomic knowledge discovery is often confronted by the challenges of data integration from a multitude of independent databases and research articles

Read more

Summary

Introduction

The availability of annotated reference genome assemblies for several crop species including tomato [1], potato [2], brassica [3] and cucumber [4] has enabled plant breeders and researchers to elucidate a trait’s linkage to a genomic location(s). Mining genome annotations can help in identifying candidate genes that positively or negatively affect a trait of interest, which plant breeders aim to improve. Genome annotations are commonly available across multiple databases and file formats (e.g., in the Generic Feature Format [GFF]), which hampers integrated data analyses.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call