Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes

Milan Radovich,Ibrahim Numanagić,Todd C Skaar,Xiang Qin,Lorraine Toji,Steve Scherer,Bonnie Berger,Michael Ford,S Cenk Sahinalp,Salem Malikić,Victoria M Pratt

doi:10.1038/s41467-018-03273-1

Abstract

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene. Although many clinically and functionally important genes are highly polymorphic and have undergone structural alterations, no high-throughput sequencing data analysis tool has yet been designed to effectively solve the full allelic decomposition problem. Here we introduce a combinatorial optimization framework that successfully resolves this challenging problem, including for genes with structural alterations. We provide an associated computational tool Aldy that performs allelic decomposition of highly polymorphic, multi-copy genes through using whole or targeted genome sequencing data. For a large diverse sequencing data set, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays. As more data sets become available, we expect Aldy to become an essential component of genotyping toolkits.

Highlights

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene
Available copy number alteration detection/copy number phasing tools aim to identify the number of copies of a particular gene under the implicit assumption that gene duplications or deletions always affect the entire gene of interest, but do not reconstruct the exact sequence content of the gene
On a large data set involving 96 cell lines sequenced via the PGRNseq v.2 protocol, comprised of 32 family trios, 137 cell lines sequenced with the PGRNseq v.1 protocol, and 25 whole-genome sequencing (WGS) Illumina samples, we show that Aldy is able to reconstruct the sequence content of each copy of some of the most challenging genes in the human genome and identify many novel alleles, significantly improving the accuracy and utility of currently used genotyping assays

Summary

Introduction

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene. No existing tool aims to find out what happens when structural alterations affect genes with multiple copies or those with highly homologous pseudogenes Such genes are algorithmically difficult to resolve since reads that originate from such genes have high mapping ambiguity. In order to reconstruct the sequence content of a structurally altered gene, one needs to (i) find out how many copies of the gene there are and which read belongs to which copy (i.e., mapping ambiguity resolution), and (ii) implicitly or explicitly assemble each copy of the gene from the read set (this is inherently intermingled with mapping ambiguity resolution) and find out its origins (in the reference genome) This requires one to (a) identify all structural alteration breakpoints and carefully reconstruct the sequence content of each breakpoint region, while taking into account all micro-structural alterations, indels, and single nucleotide variants (SNVs) each copy of the gene has been subject to, and (b) identify fusions/hybridizations between the gene and its highly homologous pseudogenes. Even tools that aim to genotype a particular gene such as CYP2D6, namely Cypiripi[13] and Astrolabe (formerly Constellation)[14], respectively work only on uniform coverage sequencing data, or can determine the gene’s sequence content only if it differs from the reference by SNVs but not structural variation

Objectives

Methods

Results

Conclusion