Abstract

BackgroundSequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data.ResultsWe develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP.ConclusionsBased on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data.AvailabilityThe software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.

Highlights

  • Sequencing technologies keep on turning cheaper and faster, putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein

  • Availability: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2​ snp

  • Ebwt2snp takes as input few main parameters, among which the most important are the lengths of right and left SNPs contexts in the output (−L and −R), and (−v) the maximum number of non-isolated SNPs to seek in the left contexts

Read more

Summary

Introduction

Sequencing technologies keep on turning cheaper and faster, putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. Sequencing technologies keep on turning cheaper and faster, producing huge amounts of data that put a growing pressure on data structures designed to store raw sequencing information, as well as on efficient analysis algorithms: collections of billion of DNA fragments (reads) need to be efficiently indexed for downstream analysis. In some applications aligning reads on a reference genome presents limitations mainly due to the difficulty of mapping highly repetitive regions, especially in the event of a low-quality reference genome, not to mention the cases in which the reference genome is not even available

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call