Large sets of edit-metric sequence identification tags to facilitate large-scale multiplexing of reads from massively parallel sequencing

Brant Faircloth,Travis Glenn

doi:10.1038/npre.2011.5672.1

Brant Faircloth, Travis Glenn

Open Access

https://doi.org/10.1038/npre.2011.5672.1

Copy DOI

Abstract

Abstract Background Massively parallel DNA sequencing technologies provide exponential increases in the amount of data returned from the sequencing process, relative to traditional (Sanger-based) techniques. Use of unique, synthetic oligonucleotides (identifying sequence tags) on each sample enables the deconvolution of samples pooled prior to massively parallel sequencing. To counteract oligonucleotide synthesis and sequencing errors, sequence tags should be drawn from combinations with sufficient differences and with an appropriate error-correcting code over the alphabet [A,C,G,T]. This method ensures that errors within the tags do not cause sequences to be assigned to the wrong sample while also enabling correction and recovery of incorrectly-sequenced or incorrectly-synthesized tags. The set of available tags should be large, allowing sample multiplexing to scale with rapid changes in sequencing platform output. The set of tags should also account for errors possible during both the oligonucleotide synthesis and sequencing processes. Most tags in current use have been designed to maintain a particular Hamming distance: a scheme that ensures the distance between tag sequences is maintained in the presence of substitutions. However, sequence identification tags conforming to the edit-metric are more appropriate: edit-metric sequence tags are robust to insertions, deletions, and substitution errors. Results We present edittag, a python package containing several tools to facilitate the design of edit-metric-based sequence identification tags, check existing sets of sequence identification tags for conformance to the edit-metric, and apply sequence identification tags to primers and/or adapters. We use edittag to design several large sets of edit-metric sequence tags ranging from four to 10 nucleotides in length and edit distance three to nine. Finally, we test a set of fusion primers designed with the software developed here, demonstrating high levels of successful amplification. Conclusions Researchers using sequence identification tags should consider using edit-metric-based sequence tags in place of the more common, Hamming-distance-based alternatives. Edit-metric sequence tags are robust to insertion, deletion, and substitution errors, thus more robust to all forms of error present during the DNA sequencing process. Edittag facilitates the generation and application of these robust sequence tags.

Highlights

1% error, 1 M reads, 6 nt sequence tag 1% error, 1 M reads, 7 nt sequence tag 1% error, 1 M reads, 8 nt sequence tag 1% error, 1 M reads, 9 nt sequence tag 1% error, 1 M reads, 10 nt sequence tag.
CDDCE" @FBG" 6758%$'95&'" .:" ;&2+$.&95&'%-" H5%-'/" F,+5&,5" %&0" 45.$
9"/:$#)12+& " F=&'/5'+,?" .-+,-5.'+05" 35S>5&,5" +05&'+:+,%'+.&" '%5&,5" '%5&,+&'8>'" :$.9" '/535" 8-%':.$93" %9.&,50" '/5" +05%" .:" 35S>5&,5" '%5&,5" '%5&,5" '%3?" 35S>5&,+&&$5,.'50" 5$$.$" $%'5".:"!)De"0>$+&&"8$.0>,+&

Summary

Introduction

1% error, 1 M reads, 6 nt sequence tag 1% error, 1 M reads, 7 nt sequence tag 1% error, 1 M reads, 8 nt sequence tag 1% error, 1 M reads, 9 nt sequence tag 1% error, 1 M reads, 10 nt sequence tag.

Results

Conclusion