High throughput BLAST algorithm using spark and cassandra

Josep Lluis Lerida,Fernando Cores,Fernando Guirado

doi:10.1007/s11227-020-03338-3

Abstract

The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.

Full Text