Abstract
Post-sequencing genomic data analysis becomes a major challenge while next-generation sequencing technologies evolve by leaps and bounds. The data-intensive and compute-intensive nature of genome analysis makes cluster computing an attractive choice for building efficient solutions. This paper presents HiGene, a high-performance genome analysis platform that exploits big data technology to revolutionize genomics data crunching power. HiGene reconstructs the genome analysis pipeline by exploiting both multi-core and multi-node parallelization using Apache Spark, and employs two key techniques to further boost the performance. First, a dynamic computing resource re-allocator is implemented, which allows flexible on-demand resource allocation for operations inside tasks. Second, an efficient skew mitigation approach is proposed, which automatically identifies and resolves data skew and computation skew through task repartitioning and resource reallocating respectively. HiGene has been evaluated with a whole human genome dataset on a 10-node Huawei 5885 cluster. Experimental results show that HiGene achieves remarkable high performance that reduces the total running time on a whole genome sequence dataset from days to nearly one hour. Furthermore, it is two times faster than state-of-the-art cluster based approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.