Pblat: a multithread blat algorithm speeding up aligning sequences to genomes

Meng Wang,Lei Kong

doi:10.1186/s12859-019-2597-8

Abstract

BackgroundThe blat is a widely used sequence alignment tool. It is especially useful for aligning long sequences and gapped mapping, which cannot be performed properly by other fast sequence mappers designed for short reads. However, the blat tool is single threaded and when used to map whole genome or whole transcriptome sequences to reference genomes this program can take days to finish, making it unsuitable for large scale sequencing projects and iterative analysis. Here, we present pblat (parallel blat), a parallelized blat algorithm with multithread and cluster computing support, which functions to rapidly fine map large scale DNA/RNA sequences against genomes.ResultsThe pblat algorithm takes advantage of modern multicore processors and significantly reduces the run time with the number of threads used. pblat utilizes almost equal amount of memory as when running blat. The results generated by pblat are identical with those generated by blat. The pblat tool is easy to install and can run on Linux and Mac OS systems. In addition, we provide a cluster version of pblat (pblat-cluster) running on computing clusters with MPI support.Conclusionpblat is open source and free available for non-commercial users. It is easy to install and easy to use. pblat and pblat-cluster would facilitate the high-throughput mapping of large scale genomic and transcript sequences to reference genomes with both high speed and high precision.

Highlights

The blat is a widely used sequence alignment tool
With the increasing quantity of sequences generated by high throughput sequencing projects, blat cannot meet the speed requirements needed for large-scale analysis and regularly updated annotations
When used to map the whole transcriptome sequences of vertebrates to a reference genome, it would take days to finish using blat. This is due to the blat algorithm being single threaded and, not taking full advantage of modern multicore processors

Summary

Results

Performance evaluation of pbalt We evaluated the performance of pblat using different number of threads and compared to the results of the original blat. The speedup was consistent with results in the last analysis These results showed pblat could significantly accelerate aligning long sequencing reads generated by the Oxford Nanopore and PacBio SMRT (Single Molecule Real-Time) sequencing platforms. Results indicated that the run time decreased significantly with the increasing number of computing nodes employed (Fig. 2). The blat program took 6.4 h with one thread on one node to align all the test sequences to the reference genome. When using 15 nodes, pblat-cluster reduced the time consumption to 6.8 min, which was 6.47x speedup than pblat with 12 threads in one node and 51.18x speedup than blat. The pblat with 12 threads took 44 min to align all the test transcripts to the reference genome.

Background

Availability of data and materials Not applicable