Abstract

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene. Although many clinically and functionally important genes are highly polymorphic and have undergone structural alterations, no high-throughput sequencing data analysis tool has yet been designed to effectively solve the full allelic decomposition problem. Here we introduce a combinatorial optimization framework that successfully resolves this challenging problem, including for genes with structural alterations. We provide an associated computational tool Aldy that performs allelic decomposition of highly polymorphic, multi-copy genes through using whole or targeted genome sequencing data. For a large diverse sequencing data set, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays. As more data sets become available, we expect Aldy to become an essential component of genotyping toolkits.

Highlights

  • High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene

  • Available copy number alteration detection/copy number phasing tools aim to identify the number of copies of a particular gene under the implicit assumption that gene duplications or deletions always affect the entire gene of interest, but do not reconstruct the exact sequence content of the gene

  • On a large data set involving 96 cell lines sequenced via the PGRNseq v.2 protocol, comprised of 32 family trios, 137 cell lines sequenced with the PGRNseq v.1 protocol, and 25 whole-genome sequencing (WGS) Illumina samples, we show that Aldy is able to reconstruct the sequence content of each copy of some of the most challenging genes in the human genome and identify many novel alleles, significantly improving the accuracy and utility of currently used genotyping assays

Read more

Summary

Introduction

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest—the number of copies and the exact sequence content of each copy of a gene. No existing tool aims to find out what happens when structural alterations affect genes with multiple copies or those with highly homologous pseudogenes Such genes are algorithmically difficult to resolve since reads that originate from such genes have high mapping ambiguity. In order to reconstruct the sequence content of a structurally altered gene, one needs to (i) find out how many copies of the gene there are and which read belongs to which copy (i.e., mapping ambiguity resolution), and (ii) implicitly or explicitly assemble each copy of the gene from the read set (this is inherently intermingled with mapping ambiguity resolution) and find out its origins (in the reference genome) This requires one to (a) identify all structural alteration breakpoints and carefully reconstruct the sequence content of each breakpoint region, while taking into account all micro-structural alterations, indels, and single nucleotide variants (SNVs) each copy of the gene has been subject to, and (b) identify fusions/hybridizations between the gene and its highly homologous pseudogenes. Even tools that aim to genotype a particular gene such as CYP2D6, namely Cypiripi[13] and Astrolabe (formerly Constellation)[14], respectively work only on uniform coverage sequencing data, or can determine the gene’s sequence content only if it differs from the reference by SNVs but not structural variation

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call