Abstract

Sir, Whole-genome sequencing (WGS) of bacteria will soon be incorporated into clinical microbiological laboratory workflows where rapid, accurate and automated data analysis will be key. Currently, two major strategies are used for the detection and identification of antimicrobial resistance genes in sequence data: the transfer of annotation reference whole-genome sequences and, more commonly, the identification of resistance genes via comparison against a compiled reference ‘pseudomolecule’ of resistance genes using short-sequence-read mapping techniques such as BWA (Burrows-Wheeler Aligner). The pseudomolecule, or resistance gene database, can be held locally or can be held ‘server side’ on web sites such as ResFinder or the Comprehensive Antimicrobial Resistance Database (CARD). Sequences in such databases are often sourced from GenBank and then subjected to varying amounts of curation. However, errors in the annotation of sequence entries in GenBank can be as high as 80% in some protein families, and the rate of these errors has increased over time, which highlights the risk that errors may propagate and self-inflate. b-Lactamase gene sequences have an unusually complex nomenclature in which each amino-acid-changing single nucleotide polymorphism (SNP) can change the activity profile of the enzyme and result in a new allele variant. To control the accuracy of b-lactamase nomenclature, allele designations are collated independently of GenBank at www.lahey.org/Studies. We assessed the frequencies, temporal distribution and interrelatedness of mis-annotations among all of the three most clinically worrisome metallo-b-lactamases: types IMP, VIM and NDM. We also considered the necessary checks and measures that may be needed for the use of the technology to detect resistance genes and other clinically relevant factors in clinical microbiology laboratory WGS tests. From BLASTp query of GenBank using a set of sequences for all of the IMP, NDM and VIM carbapenemases detailed on the curated database at the Lahey Clinic we found 581 full-length metallo-b-lactamase protein entries consisting of 202, 96 and 283 full-length protein entries for each respective family. Inspection of phylogenetic trees for each family, comparison against the curated reference database at Lahey Clinic and a confirmation of errors via multiple alignments using ClustalW revealed that erroneously annotated entries occurred at an overall error frequency of 6% (n1⁄435). This was at the lower end of the reported average range of annotation error frequencies (5%–63%) in GenBank, which, given the complex nomenclature of these enzymes, is likely to be due to workers referring to the expert curation web site. Classification of the incorrect annotations revealed that misannotations, characterized by the incorrect allocation of allele numbers within a protein family, were the most common error type (n1⁄422, consisting of 11, 1 and 10 errors in the IMP, NDM and VIM families, respectively). The remaining errors were split between incomplete annotations (n1⁄46 with 5, 0 and 1 in the IMP, NDM and VIM families, respectively), such as the lack of allele number (e.g. annotated only as VIM), and under-annotations (n1⁄47 with 4, 2 and 1 in the IMP, NDM and VIM families, respectively), typified by annotation as a carbapenemase or metallo-blactamase (Figure 1, and summarized in Table S1, available as Supplementary data at JAC Online). There is concern over the transmission and inflation of errors as next-generation sequencing and automated annotation processes become more widely used. To test whether there was any evidence of errors transmitting, becoming self-replicating and/or otherwise accumulating in GenBank we looked for evidence of chains of errors that were identical or closely related in the phylogenetic trees we generated. In all three families the number of errors occurring in each year remained very low in number and did not exceed a maximum of four (IMPs in 2010) (Figure S1, available as Supplementary data at JAC Online). Furthermore, we found no evidence of the systematic transmission of errors as a result of (for example) automated, unchecked annotations (including when we examined their first entry notes) (Table S2, available as Supplementary data at JAC Online). Diversity in the types of error observed suggested that the sources of the errors were varied and possibly human-induced; for example, two sequences that were identical to IMP-26 were deposited as IMP-4 (GI|83583501| and GI|410066876|—Figure 1a) by researchers in different countries 4 years apart and occurred prior to the advent of highthroughput WGS and widely available automated annotations. Thus errors can, even without automated processes, become self-replicating in a non-curated and ‘live’ database such as GenBank. An alternative route by which errors can be transmitted is via the transferral of annotations directly from reference wholegenome sequences. The observation of mis-annotated VIM-2 sequence in the GenBank RefSeq database (Figure 1c) served as

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call