Abstract

With the development of next-generation sequencing (NGS), DNA/RNA sequencing has become cheaper and more efficient. Today, a whole human genome can be sequenced under $1,000, providing opportunities for large-scale bioinformatic analysis on big datasets. However, most of existing bioinformatic analysis tools are programmed for single server based computing platform and not suitable to process such big datasets. As Hadoop MapReduce and Spark are gaining popularity as cluster computing based big data processing platform, more and more bioinformatic applications start to explore cluster computing platform for large scale data analysis. In this paper we present an in-depth experimental study on deploying Spark clusters for high performance bioinformatic short sequence reconstruction. Our experimental results enable us to answer a number of challenging and yet most frequently asked questions regarding efficient management of bioinformatic data analysis services on Spark systems. Example questions include how to best split big dataset into multiple partitions, and how to distribute data partitions and bioinformatic analysis tasks on a Spark cluster for carrying out a high performance distributed analysis job? What types of memory models are effective for bioinformatic data analysis services on a Spark cluster? Why do different bioinformatic data analysis operations exhibit different throughput performance on the same Spark cluster? We conjecture that this experimental study not only demonstrates the feasibility of high performance bioinformatic data analysis on Spark platform, but also will help bioinformatic application developers to make more informed decisions on both design and configuration of Spark Cluster, managing and tuning parameters of Spark runtime system for enhancing the performance of large scale big data analytics.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.