Abstract

AbstractSuffix trees, which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text data mining, bioinformatics and computational biology. In particular, suffix trees are useful in bioinformatics applications, because they can search similar sub-sequences and extract frequent sequence patterns efficiently. In recent years, efficient construction of a suffix tree that allows faster sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially. This paper proposes a novel parallelization model for approximate sequence matching that uses disk-based suffix trees, which are built on hard disks not on memory, on a multi-core CPU. In the proposed parallelization model, we divide an entire sequence database into two or more sub-databases called partitions. For each partition, we build a disk-based suffix tree and define a task as an approximate sequence matching on one disk-based suffix tree. Moreover, the proposed parallelization model involves a multiple buffering management system to avoid conflicts among CPU-cores. We evaluated the proposed parallelization model using an actual amino acid sequence database on a PC. The experimental results show a substantial improvement in computation performance.

Highlights

  • S UFFIX trees [1], [2], [3], which are trie structures that present the suffixes of sequences, are widely used for sequence search in different application domains such as, text mining, pattern matching, bioinformatics and computational biology

  • We build a suffix tree on hard disks, and a task is defined as an approximate sequence matching on one disk-based suffix tree, which is built from a partition

  • An approximate sequence matching on the sequence database SD can be performed in parallel using these diskbased suffix trees, which are built on a partition, because approximate sequence matching for each partition can be divide build PS1 PS2

Read more

Summary

INTRODUCTION

S UFFIX trees [1], [2], [3], which are trie structures that present the suffixes of sequences (e.g., strings), are widely used for sequence search in different application domains such as, text mining, pattern matching, bioinformatics and computational biology. Efficient construction of a suffix tree that allows high-speed sequence searches has become one of the most important challenges, because the number and size of the data that are stored in sequence databases have been increasing exponentially [4], [5], [6]. This study focuses on parallel approximate sequence matching using disk-based suffix trees on a multi-core CPU. The goal of this study is to develop a efficient parallelization model for parallel approximate sequence matching for largescale sequence databases on a multi-core CPU. It is necessary to develop an efficient parallelization model for the parallel approximate sequence matching using disk-based suffix trees on a multicore CPU, because a multi-core CPU has some characteristics that are different from a conventional CPU. A novel parallelization model for the parallel approximate sequence matching on disk-based suffix trees using data partition-based parallelism is proposed.

RELATED WORK
Suffix Tree
Disk-based Suffix Tree
Approximate Sequence Matching
Data Parallelism
Task Model
Multiple Buffering
Algorithm
Experimental Setup
Experiment 1
Experiment 2
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call