Abstract

A standard practise in palaeogenome analysis is the conversion of mapped short read data into pseudohaploid sequences, frequently by selecting a single high-quality nucleotide at random from the stack of mapped reads. This controls for biases due to differential sequencing coverage, but it does not control for differential rates and types of sequencing error, which are frequently large and variable in datasets obtained from ancient samples. These errors have the potential to distort phylogenetic and population clustering analyses, and to mislead tests of admixture using D statistics. We introduce Consensify, a method for generating pseudohaploid sequences, which controls for biases resulting from differential sequencing coverage while greatly reducing error rates. The error correction is derived directly from the data itself, without the requirement for additional genomic resources or simplifying assumptions such as contemporaneous sampling. For phylogenetic and population clustering analysis, we find that Consensify is less affected by artefacts than methods based on single read sampling. For D statistics, Consensify is more resistant to false positives and appears to be less affected by biases resulting from different laboratory protocols than other frequently used methods. Although Consensify is developed with palaeogenomic data in mind, it is applicable for any low to medium coverage short read datasets. We predict that Consensify will be a useful tool for future studies of palaeogenomes.

Highlights

  • The recovery of nuclear genomic data from ancient biological material—i.e., palaeogenomic data—is typically complicated by high levels of contamination, a low abundance of ancient nucleic acids, and the physical properties of the molecules themselves, such as short fragment length and the presence of miscoding and blocking lesions [1,2,3,4]

  • Standard single nucleotide polymorphism (SNP) calling approaches involving the identification of heterozygous positions are likely to be error prone when applied to low coverage palaeogenome data, methods have been developed for bypassing these problems to some extent [14]

  • We demonstrate, using simulated palaeogenomic data, that Consensify is more resistant to false positives than other available methods, and that it is generally more conservative than other methods when applied to real-world empirical examples

Read more

Summary

Introduction

The recovery of nuclear genomic data from ancient biological material—i.e., palaeogenomic data—is typically complicated by high levels of contamination, a low abundance of ancient nucleic acids, and the physical properties of the molecules themselves, such as short fragment length and the presence of miscoding and blocking lesions [1,2,3,4]. A frequently used approach to account for this problem is to exclude transition sites This is an effective means of dealing with transition errors resulting from cytosine deamination, extended terminal branches leading to ancient, relative to modern, samples are observed in published phylogenetic trees based on transversions only (e.g., [5]). The extended D statistic [22], rather than standard pseudohaploidisation, makes use of the complete read stack and can further apply a correction to error rates estimated by comparison to data from a high-quality “error-free” individual This method assumes that an excess of singletons in the test dataset relative to the error-free individual is attributed to error, and uses this difference to correct the observed allele counts. Consensify represents a useful tool for future studies of palaeogenomes

The Consensify Method
Test Datasets
Generation of the Consensify Sequences
Effect of Consensify on Phylogenetic and Clustering Analysis
Effect of Consensify on Admixture Tests
Statistical Properties of the Consensify Method
Effect
Comparisons
Discussion
Evolutionary
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.