Somatic mosaicism has been implicated in several developmental disorders, cancers, and other diseases. Short tandem repeats (STRs) consist of repeated sequences of 1-6bp and comprise >1 million loci in the human genome. Somatic mosaicism at STRs is known to play a key role in the pathogenicity of loci implicated in repeat expansion disorders and is highly prevalent in cancers exhibiting microsatellite instability. While a variety of tools have been developed to genotype germline variation at STRs, a method for systematically identifying mosaic STRs is lacking. We introduce prancSTR, a novel method for detecting mosaic STRs from individual high-throughput sequencing datasets. prancSTR is designed to detect loci characterized by a single high-frequency mosaic allele, but can also detect loci with multiple mosaic alleles. Unlike many existing mosaicism detection methods for other variant types, prancSTR does not require a matched control sample as input. We show that prancSTR accurately identifies mosaic STRs in simulated data, demonstrate its feasibility by identifying candidate mosaic STRs in Illumina whole genome sequencing data derived from lymphoblastoid cell lines for individuals sequenced by the 1000 Genomes Project, and evaluate the use of prancSTR on Element and PacBio data. In addition to prancSTR, we present simTR, a novel simulation framework which simulates raw sequencing reads with realistic error profiles at STRs. prancSTR and simTR are freely available at https://github.com/gymrek-lab/trtools. Detailed documentation is available at https://trtools.readthedocs.io/.
Read full abstract