Fuzzysplit: demultiplexing and trimming sequenced DNA with a declarative language.

Daniel Liu

doi:10.7717/peerj.7170

Abstract

Next-generation sequencing technologies create large, multiplexed DNA sequences that require preprocessing before any further analysis. Part of this preprocessing includes demultiplexing and trimming sequences. Although there are many existing tools that can handle these preprocessing steps, they cannot be easily extended to new sequence schematics when new pipelines are developed. We present Fuzzysplit, a tool that relies on a simple declarative language to describe the schematics of sequences, which makes it incredibly adaptable to different use cases. In this paper, we explain the matching algorithms behind Fuzzysplit and we provide a preliminary comparison of its performance with other well-established tools. Overall, we find that its matching accuracy is comparable to previous tools.

Highlights

Advances in next-generation DNA sequencing technology allow large quantities of multiplexed DNA to be sequenced
Many methods, including the Genotyping by Sequencing (GBS) (Elshire et al, 2011) strategy, require sequenced DNA to first undergo preprocessing before further processing and analysis
When demultiplexing, reads of DNA are split into different files according to the barcode matched in the DNA sequence

Summary

INTRODUCTION

Advances in next-generation DNA sequencing technology allow large quantities of multiplexed DNA to be sequenced. Fuzzysplit uses a greedy overarching algorithm that matches a list of arbitrary patterns P1...jPj from one line of the template file with its corresponding line of input text T1 : : : |T|. It partitions the list of patterns into continuous chunks of either fixed-length patterns (both fuzzy patterns and fixed-length wildcard patterns) or interval-length patterns. If valid matches are found for both the interval-length chunk and the Algorithm 1 Matching one line of input text with its corresponding patterns from the template file. The main thread may block to wait for the worker threads by using a semaphore, in order to constrain the amount of reads stored in memory at one time

RESULTS

LIMITATIONS

CONCLUSION