PASS: Protein Annotation Surveillance Site for Protein Annotation Using Homologous Clusters, NLP, and Sequence Similarity Networks.

Jin Tao,Shira L Broschat,Kelly A Brayton

doi:10.3389/fbinf.2021.749008

Abstract

Advances in genome sequencing have accelerated the growth of sequenced genomes but at a cost in the quality of genome annotation. At the same time, computational analysis is widely used for protein annotation, but a dearth of experimental verification has contributed to inaccurate annotation as well as to annotation error propagation. Thus, a tool to help life scientists with accurate protein annotation would be useful. In this work we describe a website we have developed, the Protein Annotation Surveillance Site (PASS), which provides such a tool. This website consists of three major components: a database of homologous clusters of more than eight million protein sequences deduced from the representative genomes of bacteria, archaea, eukarya, and viruses, together with sequence information; a machine-learning software tool which periodically queries the UniprotKB database to determine whether protein function has been experimentally verified; and a query-able webpage where the FASTA headers of sequences from the cluster best matching an input sequence are returned. The user can choose from these sequences to create a sequence similarity network to assist in annotation or else use their expert knowledge to choose an annotation from the cluster sequences. Illustrations demonstrating use of this website are presented.

Highlights

Recent advances in the development of high-throughput sequencing technologies and computing capacity have greatly improved the speed of genome sequencing and, as a consequence, have contributed to the exponential growth of genomes in public repositories (Tao et al, 2021)
Our annotation pipeline contains three major components: homologous clusters of close to 8.5 million protein sequences deduced from the representative genomes of bacteria, archaea, TABLE 1 | Number of representative genomes of bacteria, archaea, eukarya, and viruses obtained using the Genome Information Browser by Organism available through the National Center for Biotechnology Information (NCBI) in July 2020
To determine whether functional annotation has been confirmed for a protein sequence, we developed a smart natural language processing (NLP) tool which periodically and automatically queries the UniprotKB database to determine whether a protein function has been experimentally verified

Summary

Introduction

Recent advances in the development of high-throughput sequencing technologies and computing capacity have greatly improved the speed of genome sequencing and, as a consequence, have contributed to the exponential growth of genomes in public repositories (Tao et al, 2021). In (Lockwood et al, 2019) homologous clustering was performed using protein sequences downloaded from the National Center for PASS: Protein Annotation Surveillance Site. 44 non-GroEL protein sequences were incorrectly annotated as chaperonin GroEL by the authors submitting the sequences. For this universally conserved protein, the preferred annotation in the UniProtKB/ Swiss-Prot database is 60 kDa chaperonin while NCBI RefSeq annotates it as molecular chaperone GroEL (Lockwood et al, 2019), demonstrating a clear inconsistency even between the two databases. The awareness of problems with protein annotation and, in particular, the issue of error propagation, indicates that a tool to help life scientists with accurate protein annotation would be useful

Methods

Results

Conclusion