Abstract
BackgroundAs one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%.MethodsIn this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies.ResultsMixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns.ConclusionsOur method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.
Highlights
As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases
Let c be a cycle in Gk (SR) such that c is formed by an exact tandem repeat (ETR) ε from the target DNA fragment D
In order to limit the number of short reads r used in the local assemblies, if |s| < g, we extend s by successive concatenations of p such that s is the shortest string for which |s| ≥ g
Summary
We present MixTaR, our solution to the DE NOVO HYBRID TANDEM REPEAT DETECTION problem, defined as follows. We consider that we have for v the highest chance to find an errorless occurrence in LR between the vertices of the detected cycles from v This additional condition is used in the second step of MixTaR, to validate the obtained patterns using the set LR. In order to deduce a potential ETR pattern from c, the remaining frequency of the arcs of c has to respect Property 1 In this case, we construct the ETR ε of the pattern p spelled by the cycle c in the following manner. TR sequence assembly Because of the high error rate of long reads, the partial TR detected in the second step of MixTaR contains a significant amount of erroneous bases. We output the TR obtained on the seeds along with the TR from the contigs
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have