Building an Australian Reference Genome Atlas

Nigel Ward,Kathryn Hall,Sarah Richmond,Hamish Holewa,Jeff Christiansen,Nicholas dos Remedios

doi:10.3897/biss.6.91415

Abstract

Currently, genomics data for living species are stored in public and private repositories online. These repositories remain largely disconnected and only partially findable. The Australian Reference Genome Atlas (ARGA) Project is solving the problem of genomics data obscurity by creating an online platform where life sciences researchers can comprehensively and confidently search for data for taxa relevant to Australian research. At its most basic, ARGA is a tool for aggregating and indexing publicly available genomics (and genetics) data. We aim to improve the experience of discovering and accessing this data by building search functionality, based on features such as phenotypic traits and predicted and observed species distributions, and supporting data packaging and transfer to analysis environments. ARGA will index GenBank (National Institutes of Health (NIH), USA), the European Nucleotide Archive (EMBL-ENA), the database of Bioplatforms Australia, and selected DNA repositories in Australian faunal collections and herbaria. We will integrate these records with the occurrence records and taxonomic framework of the Atlas of Living Australia (ALA) to enrich the data and make it searchable using taxonomy, location, ecological characteristics and selected phenotypic data. The chief aims and outputs for the project are to: create a system to enable contextual metadata about a species to be used as a pointer to a variety of genomic data associated with that species; add functionality to that system to enable additional contextual information groupings, and community curation of these created groupings; create a user-facing web-accessible interface for the system; and devise a mechanism that allows the researchers searching the multiple genomic repositories, via ARGA, to select files for subsequent analysis and export them to other cloud-based analysis infrastructure. create a system to enable contextual metadata about a species to be used as a pointer to a variety of genomic data associated with that species; add functionality to that system to enable additional contextual information groupings, and community curation of these created groupings; create a user-facing web-accessible interface for the system; and devise a mechanism that allows the researchers searching the multiple genomic repositories, via ARGA, to select files for subsequent analysis and export them to other cloud-based analysis infrastructure. Our approach to ARGA incorporates: ingesting species metadata from multiple sequence repositories into a consistent data format using Darwin Core Archive (DwC-A); processing metadata using the Pipelines system developed by the Global Biodiversity Information Facility (GBIF), and as implemented in the ALA and other Living Atlases. indexing metadata using a Solr search engine; and providing a front-end web interface for users to find, select and export sequence files to a number of cloud-based analysis platforms. ingesting species metadata from multiple sequence repositories into a consistent data format using Darwin Core Archive (DwC-A); processing metadata using the Pipelines system developed by the Global Biodiversity Information Facility (GBIF), and as implemented in the ALA and other Living Atlases. indexing metadata using a Solr search engine; and providing a front-end web interface for users to find, select and export sequence files to a number of cloud-based analysis platforms. Here we will present an overview of the ARGA infrastructure and demonstrate an early prototype of the platform. We will show how ARGA can be used to interrogate DNA sequence records for taxa relevant to Australian research questions, realising a vision where genomics-based solutions to biological questions in conservation, ecology, agriculture and biosecurity can be manifested.

Full Text