One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies

Shuai Yuan,Yi-Juan Hu,Zhaohui S Qin,Yun Li,H Richard Johnston,Guosheng Zhang,Paul P Gardner

doi:10.1371/journal.pcbi.1004448

Shuai Yuan, Yi-Juan Hu + Show 5 more

Open Access

https://doi.org/10.1371/journal.pcbi.1004448

Copy DOI

Abstract

With rapid decline of the sequencing cost, researchers today rush to embrace whole genome sequencing (WGS), or whole exome sequencing (WES) approach as the next powerful tool for relating genetic variants to human diseases and phenotypes. A fundamental step in analyzing WGS and WES data is mapping short sequencing reads back to the reference genome. This is an important issue because incorrectly mapped reads affect the downstream variant discovery, genotype calling and association analysis. Although many read mapping algorithms have been developed, the majority of them uses the universal reference genome and do not take sequence variants into consideration. Given that genetic variants are ubiquitous, it is highly desirable if they can be factored into the read mapping procedure. In this work, we developed a novel strategy that utilizes genotypes obtained a priori to customize the universal haploid reference genome into a personalized diploid reference genome. The new strategy is implemented in a program named RefEditor. When applying RefEditor to real data, we achieved encouraging improvements in read mapping, variant discovery and genotype calling. Compared to standard approaches, RefEditor can significantly increase genotype calling consistency (from 43% to 61% at 4X coverage; from 82% to 92% at 20X coverage) and reduce Mendelian inconsistency across various sequencing depths. Because many WGS and WES studies are conducted on cohorts that have been genotyped using array-based genotyping platforms previously or concurrently, we believe the proposed strategy will be of high value in practice, which can also be applied to the scenario where multiple NGS experiments are conducted on the same cohort. The RefEditor sources are available at https://github.com/superyuan/refeditor.

Highlights

Mapping short reads onto the reference genome is a fundamental step in analyzing generation sequencing (NGS) data and has been an area of intensive research in the past years
Despite the vast differences in algorithms and indexing methods, almost all of the existing read-mapping programs rely on the universal haploid reference genome—the National Center for Biotechnology Information (NCBI) human reference genome [20], which was derived from a small number of anonymous donors
The sequencing read (ID: SRR005197.10106228) containing the alternative allele G at that locus can be successfully mapped to the personalized diploid reference genome with two mismatches

Summary

Introduction

Mapping short reads onto the reference genome is a fundamental step in analyzing generation sequencing (NGS) data and has been an area of intensive research in the past years. A wealth of successful software programs for mapping short reads, such as MAQ [1], SOAP [2], SOAP2 [3],BOWTIE [4], BOWTIE2 [5], BWA [6], BFAST [7], mrFAST [8], mrsFAST [9], NovoAlign (http://novocraft.com), SHRiMP [10], and STAR[11], have been developed and enjoyed wide-spread usage in many different NGS applications (e.g., whole genome sequencing (WGS) [12], whole exome sequencing (WES) [13], Chromatin Immunoprecipitation sequencing (ChIP-seq) [14,15,16] and transcriptome sequencing or RNA-seq [17]). The human genome is diploid, and each individual possesses a unique set of genetic variants at millions of loci that distinguish him or her from others Such wide-spread genetic variants, compounded with non-ignorable sequencing errors and short read length, cause a large proportion of reads to be unmapped or mapped to incorrect genomic locations. These mapping artifacts sometimes lead to misinterpretation of the NGS experimental results, such as the overstating the incidence of Allele Specific Expression [21,22,23,24,25] and affecting regulatory element identification at heterozygous variants [22, 26, 27]

Objectives

Methods

Results

Conclusion