Abstract

Short DNA oligonucleotides (~4 mer) have been used to index samples from different sources, such as in multiplex sequencing. Presently, longer oligonucleotides (8–12 mer) are being used as molecular barcodes with which to distinguish among raw DNA molecules in many high-tech sequence analyses, including low-frequent mutation detection, quantitative transcriptome analysis, and single-cell sequencing. Despite some advantages of using molecular barcodes with random sequences, such an approach, however, makes it impossible to know the exact sequences used in an experiment and can lead to inaccurate interpretation due to misclustering of barcodes arising from the occurrence of unexpected mutations in the barcodes. The present study introduces a tool developed for selecting an optimal barcode subset during molecular barcoding. The program considers five barcode factors: GC content, homopolymers, simple sequence repeats with repeated units of dinucleotides, Hamming distance, and complementarity between barcodes. To evaluate a selected barcode set, penalty scores for the factors are defined based on their distributions observed in random barcodes. The algorithm employed in the program comprises two steps: i) random generation of an initial set and ii) optimal barcode selection via iterative replacement. Users can execute the program by inputting barcode length and the number of barcodes to be generated. Furthermore, the program accepts a user’s own values for other parameters, including penalty scores, for advanced use, allowing it to be applied in various conditions. In many test runs to obtain 100000 barcodes with lengths of 12 nucleotides, the program showed fast performance, efficient enough to generate optimal barcode sequences with merely the use of a desktop PC. We also showed that VFOS has comparable performance, flexibility in program running, consideration of simple sequence repeats, and fast computation time in comparison with other two tools (DNABarcodes and FreeBarcodes). Owing to the versatility and fast performance of the program, we expect that many researchers will opt to apply it for selecting optimal barcode sets during their experiments, including next-generation sequencing.

Highlights

  • DNA barcodes are oligonucleotide sequences tagged to target DNA molecules that allow researchers to identify specific molecules in an experiment, including sequencing experiments [1, 2]

  • When examining mode hd values according to l, approximately 2/3rds of bases for each l appeared. cp distribution followed a different pattern than Hamming distance (HD) distribution (F), in which mode values were observed at approximately 1/3rds of bases for each l

  • When we examined percent decrease (PDEC) values of penalty scores for individual barcode factors (PGCCt, PHPt, PSRt, PHDt, and PCPt), their median values were 89.76% for GC content (GCC), 89.99% for HP, 90.07% for sequence repeat (SR), 52.83% for HD, and 66.91% for CP

Read more

Summary

Introduction

DNA barcodes are oligonucleotide sequences tagged to target DNA molecules that allow researchers to identify specific molecules in an experiment, including sequencing experiments [1, 2]. There are two general types of DNA barcodes [3]: The first are DNA barcodes that permit the identification of individual samples in a pooled mixture. Short DNA barcodes (~4 mer oligonucleotides) are frequently used. The second are molecular barcodes, known as unique molecular identifiers, that allow for consensus-based error correction by facilitating the unique labeling of individual molecules [4]. In many high-tech sequence analyses, longer barcodes (8–12 mer) of this second type are used to identify raw DNA molecules. DNA barcodes can be characterized according to their design (i.e., rationally designed or randomly produced) [5], and random barcodes are often used for molecular barcoding [2, 6,7,8]: note that the barcodes mentioned in this study indicate “in-line barcodes” to be sequenced together with target DNA sequences

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.