Abstract

BackgroundThe location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative.ResultsOur results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities.ConclusionsIn our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative.

Highlights

  • The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms

  • Eukaryotic protein-coding gene structure is characterized by a modular organization of introns and exons, which are commonly identified in genome sequences using automated in silico gene annotation procedures [2]

  • Structural properties of manually annotated gene models and their predecessors We assessed five structural properties of protein-coding genes when comparing automatically generated and manually annotated gene models: (i) unspliced transcript length, (ii) protein length, (iii) exon count per transcript, as well as (iv) median exon and (v) median intron length per transcript. These properties were analyzed in two gene sets: (1) the full set of automatically generated gene models (AUTO) and (2) the full official gene set (OGS; non-redundant merge of gene models that were manually annotated or added and automatically generated models)

Read more

Summary

Introduction

The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. A major goal in the field of comparative genomics is to elucidate the factors that explain the variance of gene structures within and between species It has been hypothesized, for example, that differential GC content of exons and introns within regions of low GC content in the genomes of mammals constitutes a marker for exon recognition during splicing and is a factor that stabilizes exonintron boundaries [3, 4]. Hypotheses on the evolution of gene structure organization state that introns are generated by the insertion of nonautonomous DNA-transposons [5] or, in birds, that selection on intron size is driven by the evolution of powered flight [6] Such hypotheses and observations are based on the structural description of protein-coding gene repertoires. These repertoires are typically derived from automated annotations, with only a fraction of the gene models having been refined by manual annotation and curation

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call