Abstract

BackgroundPseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The inclusion of pseudogene sequences in DNA barcoding and metabarcoding analysis can lead to misleading results. None of the most widely used bioinformatic pipelines used to process marker gene (metabarcode) high throughput sequencing data specifically accounts for the presence of pseudogenes in protein-coding marker genes. The purpose of this study is to develop a method to screen for nuclear mitochondrial DNA segments (nuMTs) in large COI datasets. We do this by: (1) describing gene and nuMT characteristics from an artificial COI barcode dataset, (2) show the impact of two different pseudogene removal methods on perturbed community datasets with simulated nuMTs, and (3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. Open reading frame length and sequence bit scores from hidden Markov model (HMM) profile analysis were used to detect pseudogenes.ResultsOur simulations showed that it was more difficult to identify nuMTs from shorter amplicon sequences such as those typically used in metabarcoding compared with full length DNA barcodes that are used in the construction of barcode libraries. It was also more difficult to identify nuMTs in datasets where there is a high percentage of nuMTs. Existing bioinformatic pipelines used to process metabarcode sequences already remove some nuMTs, especially in the rare sequence removal step, but the addition of a pseudogene filtering step can remove up to 5% of sequences even when other filtering steps are in place.ConclusionsOpen reading frame length filtering alone or combined with hidden Markov model profile analysis can be used to effectively screen out apparent pseudogenes from large datasets. There is more to learn from COI nuMTs such as their frequency in DNA barcoding and metabarcoding studies, their taxonomic distribution, and evolution. Thus, we encourage the submission of verified COI nuMTs to public databases to facilitate future studies.

Highlights

  • Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes

  • Are all the Cytochrome c oxidase subunit gene (COI) sequences filtered out using ORFfinder + hidden Markov model (HMM) profile analysis nuMTS? This method of pseudogene removal cannot distinguish between genuine pseudogenes and technical issues involving PCR or sequencing that cause frameshifts and the introduction of premature stop codons

  • We show that in a freshwater benthos COI metabarcode dataset we can remove up to 5% of arthropod exact sequence variant (ESV) as putative nuclear mitochondrial DNA segment (nuMT) even when other filtering steps are in place

Read more

Summary

Introduction

Pseudogenes are non-functional copies of protein coding genes that typically follow a different molecular evolutionary path as compared to functional genes. The purpose of this study is to develop a method to screen for nuclear mitochondrial DNA segments (nuMTs) in large COI datasets We do this by: (1) describing gene and nuMT characteristics from an artificial COI barcode dataset, (2) show the impact of two different pseudogene removal methods on perturbed community datasets with simulated nuMTs, and (3) incorporate a pseudogene filtering step in a bioinformatic pipeline that can be used to process Illumina paired-end COI metabarcode sequences. In this paper we use the term pseudogene in the general sense and the term nuMT to refer to nuclear-encoded copies of mitochondrial DNA (mtDNA). More apparent pseudogenes may have been inserted into the nuclear genome in the past, followed by the divergence of the nuMT and mtDNA, each evolving at different rates and under different constraints [13]. Including pseudogenes in phylogenetic, biodiversity, or population analyses may introduce noise leading to overestimates of haplotype or species richness or misleading identifications or relationships [13, 16,17,18,19,20,21,22,23]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call