Abstract

Post-sequencing genomic data analysis becomes a major challenge while next-generation sequencing technologies evolve by leaps and bounds. The data-intensive and compute-intensive nature of genome analysis makes cluster computing an attractive choice for building efficient solutions. This paper presents HiGene, a high-performance genome analysis platform that exploits big data technology to revolutionize genomics data crunching power. HiGene reconstructs the genome analysis pipeline by exploiting both multi-core and multi-node parallelization using Apache Spark, and employs two key techniques to further boost the performance. First, a dynamic computing resource re-allocator is implemented, which allows flexible on-demand resource allocation for operations inside tasks. Second, an efficient skew mitigation approach is proposed, which automatically identifies and resolves data skew and computation skew through task repartitioning and resource reallocating respectively. HiGene has been evaluated with a whole human genome dataset on a 10-node Huawei 5885 cluster. Experimental results show that HiGene achieves remarkable high performance that reduces the total running time on a whole genome sequence dataset from days to nearly one hour. Furthermore, it is two times faster than state-of-the-art cluster based approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call