From DNA sequences to operational reference databases: an opinionated approach using R

François Keck,Florian Altermatt

doi:10.3897/aca.4.e64936

Abstract

Reference databases of sequences that have been taxonomically assigned are a key element for DNA-based identification of organisms. Accurate and complete reference databases are necessary to associate a correct taxonomic name to the sequences obtained in studies using metabarcoding. Today many research projects using DNA metabarcoding include the development of a custom reference database, often derived from large repositories like GenBank. At the same time, many projects are focussing on the development of ready-to-use databases validated by experts and targeting specific markers and taxonomic groups. While mainstream tools such as spreadsheet softwares may be suitable to manage small databases, they quickly become insufficient when the amount of data increases and validation operations become more complex. There is a clear need for providing user‐friendly and powerful tools to manipulate biological sequences and manage reference databases. The R language which is a free software and has already been adopted by many researchers to perform their analyses is highly suitable to develop such tools. In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate. While such a single table may be less flexible and less optimized than relational databases or more complex data structures, it is easy to maintain and allows the direct use of modern dataframe centric tools. We will specifically present and discuss two R packages that can be used jointly to make reference database development more accessible and more reproducible. First, we will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis. The package implements classes and functions to make analyses of complex datasets including DNA, RNA or protein sequences as simple as possible. The strength of bioseq is to provide standard and more advanced functions to perform low level operations through a simple and consistent programming interface. Then we will present refdb, which has been developed as an environment for semi-automatic and assisted construction of reference databases. The refdb package is a reference database manager offering a set of powerful functions to import, organize, clean, filter, audit and export the data. We will outline how these two packages together can speed up reference database generation and handling, and contribute to standardization and repeatability in metabarcoding studies.

Highlights

In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects
We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate
We will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis

Summary

Introduction

We will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate.

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

From DNA sequences to operational reference databases: an opinionated approach using R

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ARPHA Conference Abstracts

Lead the way for us

Journal: ARPHA Conference Abstracts	Publication Date: Mar 4, 2021
License type: CC BY 4.0

Similar Papers

Taxalogue: a toolkit to create comprehensive CO1 reference databases
Niklas W Noll ... Christoph Scherber
PeerJ | VOL. 11
Niklas W Noll, et. al.Niklas W Noll ... Christoph Scherber
04 Dec 2023
PeerJ | VOL. 11

A detailed workflow to develop QIIME2-formatted reference databases for taxonomic analysis of DNA metabarcoding data
Benjamin Dubois ... Julie Hulin
BMC Genomic Data | VOL. 23
Benjamin Dubois, et. al.Benjamin Dubois ... Julie Hulin
08 Jul 2022
BMC Genomic Data | VOL. 23

Management of DNA reference libraries for barcoding and metabarcoding studies with the R package refdb.
François Keck ... Florian Altermatt
Molecular Ecology Resources | VOL. 23
François Keck, et. al.François Keck ... Florian Altermatt
28 Oct 2022
Molecular Ecology Resources | VOL. 23

Crabs-A software program to generate curated reference databases for metabarcoding sequencing data.
Gert‐Jan Jeunen ... Ulla Von Ammon
Molecular Ecology Resources | VOL. 23
Gert‐Jan Jeunen, et. al.Gert‐Jan Jeunen ... Ulla Von Ammon
11 Dec 2022
Molecular Ecology Resources | VOL. 23

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

From DNA sequences to operational reference databases: an opinionated approach using R

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: ARPHA Conference Abstracts