Using multiple reference genomes to identify and resolve annotation inconsistencies

Patrick J Monnahan,Jean-Michel Michno,Christine O'Connor,Nathan M Springer,Candice N Hirsch,Suzanne E Mcgaugh,Alex B Brohammer

doi:10.1186/s12864-020-6696-8

Abstract

BackgroundAdvances in sequencing technologies have led to the release of reference genomes and annotations for multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of synteny between each other, the annotated structure of gene models within these regions can differ. Of particular concern are split-gene misannotations, in which a single gene is incorrectly annotated as two distinct genes or two genes are incorrectly annotated as a single gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses.ResultsWe developed a high-throughput method based on pairwise comparisons of annotations that detect potential split-gene misannotations and quantifies support for whether the genes should be merged into a single gene model. We demonstrated the utility of our method using gene annotations of three reference genomes from maize (B73, PH207, and W22), a difficult system from an annotation perspective due to the size and complexity of the genome. On average, we found several hundred of these potential split-gene misannotations in each pairwise comparison, corresponding to 3–5% of gene models across annotations. To determine which state (i.e. one gene or multiple genes) is biologically supported, we utilized RNAseq data from 10 tissues throughout development along with a novel metric and simulation framework. The methods we have developed require minimal human interaction and can be applied to future assemblies to aid in annotation efforts.ConclusionsSplit-gene misannotations occur at appreciable frequency in maize annotations. We have developed a method to easily identify and correct these misannotations. Importantly, this method is generic in that it can utilize any type of short-read expression data. Failure to account for split-gene misannotations has serious consequences for biological inference, particularly for expression-based analyses.

Highlights

The annotation of a genome is a useful resource in many modern sequencing endeavors
Our classification method is based on the expectation that the difference in expression across the split genes should be greater if split gene annotation is correct than if the merged gene annotation is correct
To evaluate this degree of difference in expression patterns across the split genes, we developed the M2f (‘Mean 2-fold split-gene expression difference’) metric (Fig. 2a-b)

Summary

Introduction

The annotation of a genome is a useful resource in many modern sequencing endeavors. It provides the initial link connecting mapping studies to functional impact, and Despite the importance of developing high quality annotations, and the exponential increase in annotatedMonnahan et al BMC Genomics (2020) 21:281 sequences over time that have come from assembly of many new genomes, the annotation process remains notoriously error-prone [1, 6, 7]. Expression and maturation of transcripts and proteins is a highly dynamic process that varies over time as well as across different tissues, making it hard to differentiate between functional and intermediate forms. Biological errors such as transcriptional read-through, as well as chimeric transcripts, provide conflicting evidence to the true underlying gene(s). Advances in sequencing technologies have led to the release of reference genomes and annotations for multiple individuals within more well-studied systems While each of these new genome assemblies shares significant portions of synteny between each other, the annotated structure of gene models within these regions can differ. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses

Methods

Results

Discussion

Conclusion