A fast algorithm for constructing suffix arrays for DNA alphabets

Samir Elmougy,Magdi Zakaria,Zeinab Rabea,Sara El-Metwally

doi:10.1016/j.jksuci.2022.04.015

Abstract

The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.

Full Text