De novo inference of stratification and local admixture in sequencing studies

Yu Zhang

doi:10.1186/1471-2105-14-s5-s17

Abstract

Analysis of population structures and genome local ancestry has become increasingly important in population and disease genetics. With the advance of next generation sequencing technologies, complete genetic variants in individuals' genomes are quickly generated, providing unprecedented opportunities for learning population evolution histories and identifying local genetic signatures at the SNP resolution. The successes of those studies critically rely on accurate and powerful computational tools that can fully utilize the sequencing information. Although many algorithms have been developed for population structure inference and admixture mapping, many of them only work for independent SNPs in genotype or haplotype format, and require a large panel of reference individuals. In this paper, we propose a novel probabilistic method for detecting population structure and local admixture. The method takes input of sequencing data, genotype data and haplotype data. The method characterizes the dependence of genetic variants via haplotype segmentation, such that all variants detected in a sequencing study can be fully utilized for inference. The method further utilizes a infinite-state Bayesian Markov model to perform de novo stratification and admixture inference. Using simulated datasets from HapMapII and 1000Genomes, we show that our method performs superior than several existing algorithms, particularly when limited or no reference individuals are available. Our method is applicable to not only human studies but also studies of other species of interests, for which little reference information is available.Software Availability: http://stat.psu.edu/~yuzhang/software/dbm.tar

Highlights

Recent advance in high-throughput sequencing technologies [1,2,3] has enabled genome-wide identification of genetic variants at the individual level
The complete genetic landscape provides us with unprecedented opportunities to learn the evolution history of individuals and identify functional regions with phenotypic consequences at the single nucleosome polymorphism (SNP) resolution
We introduce a new method for identifying population stratification and local admixture for sequencing studies

Summary

Introduction

Recent advance in high-throughput sequencing technologies [1,2,3] has enabled genome-wide identification of genetic variants at the individual level. Single nucleosome polymorphism (SNP) is the most common and the easiest genetic information detected by sequencing. SNPs contain rich information about the evolution of individuals, and can be used as markers to pinpoint phenotype-causative loci in phenotype-ascertained samples. Sequencing technologies can detect all mutations genome-wide. The complete genetic landscape provides us with unprecedented opportunities to learn the evolution history of individuals and identify functional regions with phenotypic consequences at the SNP resolution. The complexity and the scale of sequencing data, impose new computational and statistical challenges that require development of new methodologies

Methods

Results

Conclusion