Abstract

The 1000 Genomes Project (1000G) is one of the most popular whole genome sequencing datasets used in different genomics fields and has boosting our knowledge in medical and population genomics, among other fields. Recent studies have reported the presence of ghost mutation signals in the 1000G. Furthermore, studies have shown that these mutations can influence the outcomes of follow-up studies based on the genetic variation of 1000G, such as single nucleotide variants (SNV) imputation. While the overall effect of these ghost mutations can be considered negligible for common genetic variants in many populations, the potential bias remains unclear when studying low frequency genetic variants in the population. In this study, we analyze the effect of the sequencing center in predicted loss of function (LoF) alleles, the number of singletons, and the patterns of archaic introgression in the 1000G. Our results support previous studies showing that the sequencing center is associated with LoF and singletons independent of the population that is considered. Furthermore, we observed that patterns of archaic introgression were distorted for some populations depending on the sequencing center. When analyzing the frequency of SNPs showing extreme patterns of genotype differentiation among centers for CEU, YRI, CHB, and JPT, we observed that the magnitude of the sequencing batch effect was stronger at MAF < 0.2 and showed different profiles between CHB and the other populations. All these results suggest that data from 1000G must be interpreted with caution when considering statistics using variants at low frequency.

Highlights

  • The 1000 Genomes Project (1000G) [1] corresponds to the first attempt to characterize the worldwide genetic variation in humans

  • To understand the putative effect of batch effects on statistics focusing on rare events, we studied the number of loss of function (LoF) alleles in each of the 1000G individuals

  • In this study we looked at to what extent the sequencing center, as reported by the spreadsheet of the 1000G, could influence statistics of population genomics that quantify variants at low frequency in the human genome

Read more

Summary

Introduction

The 1000 Genomes Project (1000G) [1] corresponds to the first attempt to characterize the worldwide genetic variation in humans. The project was created to generate accurate haplotype information across different human populations. The project started with 15 populations in the pilot phase, and had a total of 26 populations by the end of Phase 3, when the project was concluded. The project characterized a total of 1092 samples (not evenly distributed across the different populations) and, by the end of the Phase 3, it included a total of 2504 samples (with a close to even distribution across all populations). An additional problem is that populations were not divided evenly across all sequencing centers.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call