SLDMS: A Tool for Calculating the Overlapping Regions of Sequences.

Yu Chen,Dongliang You,Tianjiao Zhang,Guohua Wang

doi:10.3389/fpls.2021.813036

Abstract

In the field of genome assembly, contig assembly is one of the most important parts. Contig assembly requires the processing of overlapping regions of a large number of DNA sequences and this calculation usually takes a lot of time. The time consumption of contig assembly algorithms is an important indicator to evaluate the degree of algorithm superiority. Existing methods for processing overlapping regions of sequences consume too much in terms of running time. Therefore, we propose a method SLDMS for processing sequence overlapping regions based on suffix array and monotonic stack, which can effectively improve the efficiency of sequence overlapping regions processing. The running time of the SLDMS is much less than that of Canu and Flye in dealing with the sequence overlap interval and in some data with most sequencing errors occur at both the ends of the sequencing data, the running time of the SLDMS is only about one-tenth of the other two methods.

Highlights

Due to the limitations of existing gene sequencing technology, we cannot directly obtain the entire gene sequence, but can only use existing sequencing methods to sequence the genes of the species to be tested to generate sequence fragments and further genome assembly to restore the original genes
For the part of data error correction, we suggest that the third-generation sequencing data PacBio can be used for selferror correction (Hon et al, 2020) or the second-generation sequencing Illumina data can be used for error correction of the third-generation sequencing data PacBio (Mahmoud et al, 2017) such as PBCR (Koren et al, 2012) in the famous Celera Assembler (Schatz, 2006; Denisov et al, 2008) software and LoRDEC (Leena and Eric, 2014) error correction tool
When processing reads with mismatch, the SLDMS collects data for a read several times in the process of maintaining the monotonic stack, so the stored result information may be updated by subsequent maintenance and the program design idea of separating data processing and result output is adopted

Summary

INTRODUCTION

Due to the limitations of existing gene sequencing technology, we cannot directly obtain the entire gene sequence, but can only use existing sequencing methods to sequence the genes of the species to be tested to generate sequence fragments and further genome assembly to restore the original genes. The Flye (Lin et al, 2016) software uses the ABruijn (Lin et al, 2016) algorithm to combine the OLC and DBG algorithms, generates its own unique A-bruijn-graph (ABG) graph, and obtains the overlapping regions of the nodes in the graph and some other assembly software can complete the same work. These software are usually more time-consuming in the sequence alignment process. The SLDMS can be integrated into the genomic analysis process

METHODS

RESULTS

DISCUSSION

DATA AVAILABILITY STATEMENT