Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

Keiichi Tamura,Yousuke Watanuki,Hajime Kitakami,Yoshifumi Takahashi

doi:10.7603/s40601-013-0022-0

Keiichi Tamura, Yousuke Watanuki + Show 2 more

Open Access

https://doi.org/10.7603/s40601-013-0022-0

Copy DOI

Abstract

AbstractSuffix trees, which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a disk-based suffix tree and define a task as an approximate sequence matching on one disk-based suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.

Highlights

S UFFIX trees [1], [2], [3], which are trie structures that present the suffixes of sequences, are widely used for sequence search in different application domains such as, text mining, pattern matching, bioinformatics and computational biology
We build a suffix tree on hard disks, and a task is defined as an approximate sequence matching on one disk-based suffix tree, which is built from a partition
An approximate sequence matching on the sequence database SD can be performed in parallel using these diskbased suffix trees, which are built on a partition, because approximate sequence matching for each partition can be divide build PS1 PS2

Summary

INTRODUCTION

S UFFIX trees [1], [2], [3], which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text mining, pattern matching, bioinformatics and computational biology. Efficient construction of a suffix tree that allows high-speed sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially [4], [5], [6]. This study focuses on parallel approximate sequence matching using disk-based suffix trees on a multi-core CPU. The goal of this study is to develop a efficient parallelization model for parallel approximate sequence matching for largescale sequence databases on a multi-core CPU. It is necessary to develop an efficient parallelization model for the parallel approximate sequence matching using disk-based suffix trees on a multicore CPU, because a multi-core CPU has some characteristics that are different from a conventional CPU. A novel parallelization model for the parallel approximate sequence matching on disk-based suffix trees using data partition-based parallelism is proposed.

RELATED WORK

Suffix Tree

Disk-based Suffix Tree

Approximate Sequence Matching

Data Parallelism

Task Model

Multiple Buffering

Algorithm

Experimental Setup

Experiment 1

Experiment 2

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: GSTF Journal on Computing (JoC)	Publication Date: Dec 1, 2013
Citations: 20	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GSTF Journal on Computing (JoC)

Lead the way for us

Similar Papers

Parallel processing of approximate sequence matching using disk-based suffix tree on multi-core CPU
Yosuke Watanuki ... Hajime Kitakami
-
Yosuke Watanuki, et. al.Yosuke Watanuki ... Hajime Kitakami
01 Jul 2013
01 Jul 2013

Genome-scale disk-based suffix tree indexing
Benjarath Phoophakdee ... Mohammed J Zaki
-
Benjarath Phoophakdee, et. al.Benjarath Phoophakdee ... Mohammed J Zaki
11 Jun 2007
11 Jun 2007

Solving All-Pairs Suffix Prefix – Theory and Practice
Maan Haj Rachid ... Qutaibah Malluhi
-
Maan Haj Rachid, et. al.Maan Haj Rachid ... Qutaibah Malluhi
01 Jan 2015
01 Jan 2015

Efficient searches for similar subsequences of different lengths in sequence databases
S Park ... C Hsu
-
S Park, et. al.S Park ... C Hsu
28 Feb 2000
28 Feb 2000

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GSTF Journal on Computing (JoC)