Abstract
Motivation: Long expansions of short tandem repeats (STRs), i.e. DNA repeats of 2–6 nt, are associated with some genetic diseases. Cost-efficient high-throughput sequencing can quickly produce billions of short reads that would be useful for uncovering disease-associated STRs. However, enumerating STRs in short reads remains largely unexplored because of the difficulty in elucidating STRs much longer than 100 bp, the typical length of short reads.Results: We propose ab initio procedures for sensing and locating long STRs promptly by using the frequency distribution of all STRs and paired-end read information. We validated the reproducibility of this method using biological replicates and used it to locate an STR associated with a brain disease (SCA31). Subsequently, we sequenced this STR site in 11 SCA31 samples using SMRTTM sequencing (Pacific Biosciences), determined 2.3–3.1 kb sequences at nucleotide resolution and revealed that (TGGAA)- and (TAAAATAGAA)-repeat expansions determined the instability of the repeat expansions associated with SCA31. Our method could also identify common STRs, (AAAG)- and (AAAAG)-repeat expansions, which are remarkably expanded at four positions in an SCA31 sample. This is the first proposed method for rapidly finding disease-associated long STRs in personal genomes using hybrid sequencing of short and long reads.Availability and implementation: Our TRhist software is available at http://trhist.gi.k.u-tokyo.ac.jp/.Contact: moris@cb.k.u-tokyo.ac.jpSupplementary information: Supplementary data are available at Bioinformatics online.
Highlights
Many genetic disorders are caused by or associated with short tandem repeats (STRs), repetitive elements of 2–6 nt
We examined the frequency distributions of other wellcharacterized repeats, such as the (GGGTTA) repeat in telomeres (Supplementary Fig. S3C), (CAG) repeat encoding polyglutamine stretches in protein coding regions (La Spada et al, 1991; The Huntington’s Disease Collaborative Research Group, 1993; Walker, 2007 and Supplementary Fig. S4A), (CCTG) repeat associated with myotonic dystrophy type 2 (DM2; Liquori et al, 2001 and Supplementary Fig. S4B) and (ATTCT) repeat associated with spinocerebellar ataxia type 10 (SCA10; Matsuura et al, 2000 and Supplementary Fig. S4C)
We proposed a novel method for listing long approximate STRs with mutations in personal genomes using a massive number of short reads of length $100 bp
Summary
Many genetic disorders are caused by or associated with short tandem repeats (STRs), repetitive elements of 2–6 nt. STRs have been observed in a variety of genomic regions such as untranslated regions (UTRs), introns and promoters. Several expanded repeats in RNA, such as CUG, CCUG, CAG, CGG, AUUCU and UGGAA, are associated with hereditary diseases and are known to accumulate in nuclear RNA foci in which several proteins are sequestrated in the process of foci formation (for a review see Wojciechowska and Krzyzosiak, 2011). These RNA foci are thought to have a negative effect on host cells, leading to disorders in cellular pathways (Wojciechowska and Krzyzosiak, 2011)
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have