Abstract

The well-known massively parallel sequencing method is efficient and it can obtain sequence data from multiple individual samples. In order to ensure that sequencing, replication, and oligonucleotide synthesis errors do not result in tags (or barcodes) that are unrecoverable or confused, the tag sequences should be abundant and sufficiently different. Recently, many design methods have been proposed for correcting errors in data using error-correcting codes. The existing tag sets contain small tag sequences, so we used a modified genetic algorithm to improve the lower bound of the tag sets in this study. Compared with previous research, our algorithm is effective for designing sets of DNA tags. Moreover, the GC content determined by existing methods includes an imprecise range. Thus, we improved the GC content determination method to obtain tag sets that control the GC content in a more precise range. Finally, previous studies have only considered perfect self-complementarity. Thus, we considered the crossover between different tags and introduced an improved constraint into the design of tag sets.

Highlights

  • In a single run, hundreds of millions of short reads can be produced by generation sequencing instruments and this output rate will soon increase to billions of reads

  • We propose the use of a modified genetic algorithm to improve the lower bound of tag sets based on the edit distance, which is more effective for designing sets of DNA tags compared with previous methods

  • Our novel method uses a modified Genetic algorithms (GAs) to design DNA tag sets based on combinatorial constraints

Read more

Summary

Introduction

Hundreds of millions of short reads can be produced by generation sequencing instruments and this output rate will soon increase to billions of reads. Generation sequencing is a very powerful method if relatively small DNA fragments need to be sequenced using a large number of samples. This approach requires specific sequence tags that allow the detection and identification of the address of any sequence in a mixture and its assignment back to the original sample [1,2,3,4,5,6,7,8,9]. As the number of multiplexed samples increases, there is an increased likelihood that sequencing errors in the barcodes will prevent the definitive assignment of a sequencing read to a sample, which may result in the loss of data or the transformation of one tag into another, both o

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call