Abstract

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.

Highlights

  • We introduce a heuristic technique to detect the weight-lightest edges through multiple minimizers from each read, search the minimum spanning trees and forests of the Hamming-Shifting graph for a high-performance compression of the reads

  • High-throughput next-generation short-reads sequencing machines have technological characteristics that can be translated into good graph definitions to understand the connectivity of genomic sequences [1, 2]

  • The sequencing machines are not perfect, sometimes making minor mistakes in the nucleotide base-calling [4, 5]. Some of these duplicate molecular inserts have been read as different sequence strings and all stored in a digital file. We translate this lowrate of sequencing errors into an edge definition that: two reads can be connected if they mismatch only at a few base positions to reflect the fact that the two reads should be the same but they contain tiny sequencing errors

Read more

Summary

Introduction

High-throughput next-generation short-reads sequencing machines have technological characteristics that can be translated into good graph definitions to understand the connectivity of genomic sequences [1, 2]. A primary characteristic is the multi-coverage in-depth sequencing of whole DNA or RNA molecules including on repetitive genome regions, which is prone to producing duplicate reads [3]. We translate this characteristic into a node definition for the graph of genomic reads that: a read having w duplicates is defined as a node labeled with the number w. The sequencing machines are not perfect, sometimes making minor mistakes in the nucleotide base-calling [4, 5] Some of these duplicate molecular inserts have been read as different sequence strings and all stored in a digital file. Minor mismatches in the overlaps are permitted because sequencing errors can happen randomly within and across reads

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call