Abstract

BackgroundTandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution.ResultsWe developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees.ConclusionsTRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.

Highlights

  • Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders

  • We benchmarked TRiCoLOR using both synthetic data generated with VISOR [20] and real, publicly available data from the Human Genome Structural Variation Consortium (HGSVC) [8]

  • We simulated haplotype-resolved Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB) BAM files exhibiting variable error rates and depth of coverage, with each BAM file harboring a heterozygous contraction or expansion of a known, randomly chosen, tandem repeat (TR)

Read more

Summary

Introduction

Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Most importantly, >40 diseases, primarily neurological, are known to be related to TR expansions [3] Despite their clinical importance, accurately resolving TRs remains challenging in sequencing data sets mainly because of insufficient read lengths failing to encompass entire expanded repeats or technological limitations, such as high sequencing error rates. Prior methods for TR profiling in short-read sequencing data sets can be broadly classified as reference-based [4,5] or de novo [6, 7] approaches While the former investigates only reads spanning known TRs, the latter can identify TRs regardless of whether or not their repeat motif is annotated in the reference.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.