XHap: haplotype assembly using long-distance read correlations learned by transformers

Shorya Consul,Haris Vikalo,Ziqi Ke

doi:10.1093/bioadv/vbad169

Shorya Consul, Haris Vikalo + Show 1 more

Open Access

PDF Available

https://doi.org/10.1093/bioadv/vbad169

Copy DOI

Export

Save

Cite

Journal: Bioinformatics Advances	Publication Date: Jan 5, 2023
Citations: 1	License type: CC BY 4.0

Affiliation: The University of Texas at Austin

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Summary Reconstructing haplotypes of an organism from a set of sequencing reads is a computationally challenging (NP-hard) problem. In reference-guided settings, at the core of haplotype assembly is the task of clustering reads according to their origin, i.e. grouping together reads that sample the same haplotype. Read length limitations and sequencing errors render this problem difficult even for diploids; the complexity of the problem grows with the ploidy of the organism. We present XHap, a novel method for haplotype assembly that aims to learn correlations between pairs of sequencing reads, including those that do not overlap but may be separated by large genomic distances, and utilize the learned correlations to assemble the haplotypes. This is accomplished by leveraging transformers, a powerful deep-learning technique that relies on the attention mechanism to discover dependencies between non-overlapping reads. Experiments on semi-experimental and real data demonstrate that the proposed method significantly outperforms state-of-the-art techniques in diploid and polyploid haplotype assembly tasks on both short and long sequencing reads. Availability and implementation The code for XHap and the included experiments is available at https://github.com/shoryaconsul/XHap.

Full Text