Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

Eric P Nawrocki

doi:10.1093/nargab/lqad002

Abstract

In 2020 and 2021, >1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation. VADR has been updated to more accurately and rapidly annotate SARS-CoV-2 sequences. Stretches of consecutive Ns are now identified and temporarily replaced with expected nucleotides to facilitate processing, and the slowest steps have been overhauled using blastn and glsearch, increasing speed, reducing the memory requirement from 64Gb to 2Gb per thread, and allowing simple, coarse-grained parallelization on multiple processors per host. VADR is now nearly 1000times faster than it was in early 2020 SARS-CoV-2 sequence processing. It has been used to screen and annotate more than 1.5 million SARS-CoV-2 sequences since June 2020, and it is now efficient enough to cope with the current rate of hundreds of thousands of submitted sequences per month.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: NAR Genomics and Bioinformatics	Publication Date: Jan 10, 2023
Citations: 1	License type: cc-by-nc

R Discovery Prime

R Discovery Prime

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

Abstract

Talk to us

Similar Papers

More From: NAR Genomics and Bioinformatics

Lead the way for us

Similar Papers

A multiprocessor architecture combining fine-grained and coarse-grained parallelism strategies
David J Lilja
Parallel Computing | VOL. 20
David J LiljaDavid J Lilja
01 May 1994
Parallel Computing | VOL. 20

Kegg_pull: a software package for the RESTful access and pulling from the Kyoto Encyclopedia of Gene and Genomes
Erik Huckvale ... Hunter N B Moseley
BMC Bioinformatics | VOL. 24
Erik Huckvale, et. al.Erik Huckvale ... Hunter N B Moseley
04 Mar 2023
BMC Bioinformatics | VOL. 24

P076A community resource using gene feature enumeration to generate accurate allele calls and sequence annotations for HLA and KIR
Michael Halagan ... Martin Maiers
Human Immunology | VOL. 79
Michael Halagan, et. al.Michael Halagan ... Martin Maiers
31 Aug 2018
Human Immunology | VOL. 79

GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores
Satish Chikkagoudar ... Kai Wang
BMC Research Notes | VOL. 4
Satish Chikkagoudar, et. al.Satish Chikkagoudar ... Kai Wang
26 May 2011
BMC Research Notes | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR.

Abstract

Talk to us

Similar Papers

More From: NAR Genomics and Bioinformatics