CalcGen Sequence Assembler Using a Spatio-temporally Efficient DNA Sequence Search Algorithm

Kyong Oh Yoon,Sung-Bae Cho

doi:10.1016/j.procs.2013.10.016

Abstract

The advent of ultra-high-throughput sequencing technology produces an enormous amount of bio-sequence information. Also, the current advances in the bio-industry bring forward the era of personalized medicine using individual genome information. However, the analysis of massive number of bio-sequences requires large storage, so that analysis sometimes needs supercomputer and novel software that can handle such volume of sequence information. For that type of analysis, several sequence match algorithms have been devised in terms of alignment and assembly, which are fundamental for analyzing bio- sequences. Those algorithms regard nucleotide sequences as strings and compare characters one-by-one during analysis of sequences. They use hash index tables, de Bruijn graph, Burrows-Wheeler transform method, and so on. In this paper, for time and space efficient DNA searching, we propose a simple algorithm that transforms base sequence into k-mer integer array and then we analyze the integer array transformed by unit search operator and non-unit search operator, resulting in a storage space reduction of about 0.28 fold. Furthermore, based on the proposed algorithm, we have developed a sequence analysis program called CalcGen assembler, and show the usefulness of the program with several experiments.

Full Text