Abstract
High-throughput sequence data retrieved from ancient or other degraded samples has led to unprecedented insights into the evolutionary history of many species, but the analysis of such sequences also poses specific computational challenges. The most commonly used approach involves mapping sequence reads to a reference genome. However, this process becomes increasingly challenging with an elevated genetic distance between target and reference or with the presence of contaminant sequences with high sequence similarity to the target species. The evaluation and testing of mapping efficiency and stringency are thus paramount for the reliable identification and analysis of ancient sequences. In this paper, we present ‘TAPAS’, (Testing of Alignment Parameters for Ancient Samples), a computational tool that enables the systematic testing of mapping tools for ancient data by simulating sequence data reflecting the properties of an ancient dataset and performing test runs using the mapping software and parameter settings of interest. We showcase TAPAS by using it to assess and improve mapping strategy for a degraded sample from a banded linsang (Prionodon linsang), for which no closely related reference is currently available. This enables a 1.8-fold increase of the number of mapped reads without sacrificing mapping specificity. The increase of mapped reads effectively reduces the need for additional sequencing, thus making more economical use of time, resources, and sample material.
Highlights
DNA retrieved from ancient or historical specimens is typically highly degraded into small fragments with damage-derived nucleotide mis-incorporations that complicate sequence analysis [1,2].large amounts of contaminant molecules are often present in the sample, which can hamper the identification of endogenous DNA sequences [3]
Large amounts of contaminant molecules are often present in the sample, which can hamper the identification of endogenous DNA sequences [3]
The specific computational challenges posed by ancient DNA data were identified early on in the high-throughput sequencing era, which has led to a number of recommended tools and adjustments (e.g., [1,4,5,6,7,8])
Summary
DNA retrieved from ancient or historical specimens is typically highly degraded into small fragments with damage-derived nucleotide mis-incorporations that complicate sequence analysis [1,2].large amounts of contaminant molecules are often present in the sample, which can hamper the identification of endogenous DNA sequences [3]. The specific computational challenges posed by ancient DNA (aDNA) data were identified early on in the high-throughput sequencing era, which has led to a number of recommended tools and adjustments (e.g., [1,4,5,6,7,8]). Since the introduction of high-throughput sequencing, a large number of mapping tools have been developed with their own repertoire of parameters to fine-tune their performance (see [9]). This multitude of mapping tools as well as potential interactions between specific mapping parameters can make it difficult to select the most appropriate approach to maximize mapping performance for a specific dataset. A number of studies have addressed this problem by exploring and comparing the behavior of different mapping tools and
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have