Abstract
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp—reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.
Highlights
The possibility to sequence whole genomes in a cost-efficient way has revolutionized the way how we do genetic and population genetic research
Mapping next-generation sequencing reads to a single linear reference genomes comes with the inherent problem that alleles not found in the reference sequence will achieve lower mapping scores. This reference bias can cause heterozygous sites to be falsely called as homozygous which will have an effect on downstream analysis of the data. We investigate this issue in published ancient DNA data from human populations and find that reference bias is a pervasive phenomenon across data sets
We restrict our analysis to known biallelic single nucleotide polymorphisms (SNP), as most population genomic analyses are using SNPs and the allele frequencies at those positions
Summary
The possibility to sequence whole genomes in a cost-efficient way has revolutionized the way how we do genetic and population genetic research. Resequencing studies usually align the sequences of all studied individuals to a linear haploid reference sequence originating from a single individual or a mosaic of several individuals In each site, this haploid sequence will only represent a single allele out of the entire genetic variation of the species. Sequencing reads carrying an alternative allele will naturally have mismatches in the alignment to the reference genome and have lower mapping scores than reads carrying the same allele as the reference This effect increases with genetic distance from the reference genome, which is of particular interest when using a reference genome from a related species for mapping [1,2,3]. Reference bias can influence variant calling by missing alternative alleles or by wrongly calling heterozygous sites as homozygous for the reference reference allele [4, 5] which is known to influence estimates of heterozygosity and allele frequencies [6,7,8]
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have