Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes.

Kristen L Beck,Vandana Mukherjee,Edward Seabolt,James H Kaufman,Gowri Nayar,Akshay Agarwal,Simone Bianco,Harsha Krishnareddy,Timothy A Ngo,Mark Kunitomi

doi:10.3390/v13122426

Kristen L Beck, Vandana Mukherjee + Show 8 more

Open Access

PDF Available

https://doi.org/10.3390/v13122426

Copy DOI

Export

Save

Cite

Journal: Viruses	Publication Date: Dec 3, 2021
Citations: 6	License type: CC BY 4.0

Affiliation: IBM Research - Almaden

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.

Highlights

The ongoing SARS-CoV-2 pandemic has undoubtedly shaped our lives as one of the most significant global health challenges of the 21st century
With regard to pipeline accuracy, we evaluated our pipeline against VAPiD [7], which created a special release for annotating SARS-CoV-2 genomic data, and Prokka [8], a prokaryotic genome annotation tool for bacteria and viruses
Since the start of the SARS-CoV-2 global pandemic, there have been immense efforts globally to sequence with near real-time efficiency the viral genomes observed in infected patients

Summary

Introduction

The ongoing SARS-CoV-2 pandemic has undoubtedly shaped our lives as one of the most significant global health challenges of the 21st century. Unlike previous pandemics, we have sequencing technology with tremendous throughput to analyze the genomic content of SARS-CoV-2. The first sequenced SARS-CoV-2 genome [1] was submitted to NCBI on 17 January. 2020 and has become the accepted reference standard commonly referred to as the Wuhan reference genome (NCBI RefSeq ID: NC_045512.2). The sequencing of SARS-CoV-2 isolates has increased dramatically to tens of thousands of genomes a week. The SARS-CoV-2 genome is comprised of a 29,000 base pairs (bp) single-stranded RNA (38% GC content) with four structural proteins, two large polyproteins, which are cleaved to form non-structural proteins, and several accessory proteins [2,3]. There are two overlapping open reading frames responsible for Replicase polyprotein 1a (pp1a) and Replicase polyprotein 1ab (pp1ab), which yield the longest products from the genome and the majority of the non-structural proteins

Objectives

Methods

Results

Conclusion