SSRMMD: A Rapid and Accurate Algorithm for Mining SSR Feature Loci and Candidate Polymorphic SSRs Based on Assembled Sequences.

Xiangjian Gou,Zhiqiang Wang,Shihang Liu,Jian Ma,Yaxi Liu,Shifan Yu,Tao Liu,Caixia Li,Haoran Shi,Guangdeng Chen

doi:10.3389/fgene.2020.00706

Xiangjian Gou, Zhiqiang Wang + Show 8 more

Open Access

https://doi.org/10.3389/fgene.2020.00706

Copy DOI

Journal: Frontiers in Genetics	Publication Date: Jul 27, 2020
Citations: 15	License type: CC BY 4.0

Affiliation: Sichuan Agricultural University

Abstract

Microsatellites or simple sequence repeats (SSRs) are short tandem repeats of DNA widespread in genomes and transcriptomes of diverse organisms and are used in various genetic studies. Few software programs that mine SSRs can be further used to mine polymorphic SSRs, and these programs have poor portability, have slow computational speed, are highly dependent on other programs, and have low marker development rates. In this study, we develop an algorithm named Simple Sequence Repeat Molecular Marker Developer (SSRMMD), which uses improved regular expressions to rapidly and exhaustively mine perfect SSR loci from any size of assembled sequence. To mine polymorphic SSRs, SSRMMD uses a novel three-stage method to assess the conservativeness of SSR flanking sequences and then uses the sliding window method to fragment each assembled sequence to assess its uniqueness. Furthermore, molecular biology assays support the polymorphic SSRs identified by SSRMMD. SSRMMD is implemented using the Perl programming language and can be downloaded from https://github.com/GouXiangJian/SSRMMD.

Highlights

To maximize the function of each thread, we proposed a novel optimal allocation algorithm to averagely distribute assembled sequences to each thread in accordance with the length of sequences (TOS), including the following: (a) sort sequences by TOS; (b) assignment of the longest i sequences to i threads; (c) thread sorting based on the total TOS; (d) assignment of subsequent sequences to the thread with the smallest TOS; (e) thread sorting in step (d) using the insertion sorting algorithm; and (f) iterative performance of steps (d) and (e) until complete sequence allocation. (ii) Fragmented sequences
When Simple Sequence Repeat Molecular Marker Developer (SSRMMD) was used to mine polymorphic simple sequence repeat (SSR), the time and space were linearly associated with the amount of data (Figures 2C,D)
These results suggest that the algorithm of SSRMMD has linear time complexity [T(n) = O(n)] and space complexity [S(n) = O(n)]

Summary

Introduction

Codominant inheritance, multi-allelic nature, transferability, and ease of analysis via PCR (Varshney et al, 2005; Ramu et al, 2009; Kaur et al, 2015), simple sequence repeat (SSR) markers have been successfully adopted in various genetic studies such as quantitative trait loci mapping (Qin et al, 2015; Wang et al, 2017), genotyping (Gramazio et al, 2018), genetic diversity (Nachimuthu et al, 2015; Zhou R. et al, 2015), and DNA fingerprintingAlgorithm for Mining Polymorphic SSRs (Zhang et al, 2015). Numerous algorithms and software programs have been reported for mining perfect SSRs. For instance, SSRIT (Temnykh, 2001), MISA (Thiel et al, 2003), and GMATo (Wang et al, 2013) use regular expressions based on the greedy matching algorithm to mine SSRs. SA-SSR (Pickett et al, 2016) uses a suffix array-based algorithm to mine SSRs. Kmer-SSR (Pickett et al, 2017) uses Kmer decomposition to identify SSRs. PERF (Avvaru et al, 2017) matches each potential substring in accordance with a set of pre-computed repeat strings. Imperfect SSR detection algorithms have been reported, such as IMEx (Mudunuri and Nagarajaram, 2007), and Krait (Du et al, 2017). These programs have many common undesirable features. They rely on additional software or modules, often with complex software configuration; second, they have poor portability and can only be run on Linux or Windows platforms; third, they have slow computational speed; and most importantly, polymorphic SSRs cannot be directly found

Methods

Results

Conclusion