Choice of transcripts and software has a large effect on variant annotation.

Peter Humburg,Peter Donnelly,Kyle Gaulton,Jean-Baptiste Cazier,Alexander Kanapin,Manuel A Rivas,Davis J Mccarthy

doi:10.1186/gm543

Peter Humburg, Peter Donnelly + Show 5 more

Open Access

https://doi.org/10.1186/gm543

Copy DOI

Abstract

BackgroundVariant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail.MethodsThis paper quantifies the extent of differences in annotation of 80 million variants from a whole-genome sequencing study. We compare results using the RefSeq and Ensembl transcript sets as the basis for variant annotation with the software Annovar, and also compare the results from two annotation software packages, Annovar and VEP (Ensembl’s Variant Effect Predictor), when using Ensembl transcripts.ResultsWe found only 44% agreement in annotations for putative loss-of-function variants when using the RefSeq and Ensembl transcript sets as the basis for annotation with Annovar. The rate of matching annotations for loss-of-function and nonsynonymous variants combined was 79% and for all exonic variants it was 83%. When comparing results from Annovar and VEP using Ensembl transcripts, matching annotations were seen for only 65% of loss-of-function variants and 87% of all exonic variants, with splicing variants revealed as the category with the greatest discrepancy. Using these comparisons, we characterised the types of apparent errors made by Annovar and VEP and discuss their impact on the analysis of DNA variants in genome sequencing studies.ConclusionsVariant annotation is not yet a solved problem. Choice of transcript set can have a large effect on the ultimate variant annotations obtained in a whole-genome sequencing study. Choice of annotation software can also have a substantial effect. The annotation step in the analysis of a genome sequencing study must therefore be considered carefully, and a conscious choice made as to which transcript set and software are used for annotation.

Highlights

Variant annotation is a crucial step in the analysis of genome sequencing data
For the comparison of ANNOVAR and Variant Effect Predictor (VEP) we focused on exonic variants (and especially loss-of-function (LoF) and nonsynonymous variants) for the ANNOVAR/VEP comparison as these are currently of the greatest interest in the majority of annotation applications in whole-genome sequencing (WGS) studies
Different transcript sets The comparison of annotation results from ANNOVAR using either the REFSEQ or ENSEMBL transcript sets shows that the choice of transcript set has a large effect on the ultimate variant annotations

Summary

Introduction

Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. There are many different types of information that could be associated with variants, from measures of sequence conservation [3] to predictions about the effect of a variant on protein structure and function [4,5,6]. The coding sequences of the genome are, broadly speaking, the genes: ‘gene’ has come to refer principally to a genomic region producing (through transcription) polyadenylated mRNAs that encode a protein [7]. We refer to these polyadenylated mRNAs as ‘transcripts’, the term transcript can refer to any RNAs produced from the transcription of a genomic DNA sequence. Many separate transcripts may overlap any given position in the genome, and it is not uncommon for genes to have many different transcripts (or ‘isoforms’), of which they tend to express many simultaneously [8]

Methods

Results

Discussion

Conclusion