Comparing MapReduce and Spark in Computing the PCC Matrix in Gene Co-expression Networks

Nagwan Abdel Samee,Nada Hassan Osman,Rania Ahmed Abdel Azeem Abul Seoud

doi:10.14569/ijacsa.2021.0120937

Nagwan Abdel Samee, Nada Hassan Osman + Show 1 more

Open Access

https://doi.org/10.14569/ijacsa.2021.0120937

Copy DOI

Abstract

Correlation between gene expression profiles across multiple samples and the identification of inter-gene interactions is a critical technique for Co-expression networking. Due to the highly intensive processing of calculating the Pearson’s Correlation Coefficient, PCC, matrix, it often takes too much processing time to accomplish it. Therefore, in this work, Big Data techniques including MapReduce and Spark have been employed in a cloud environment to calculate the PCC matrix to find the dependencies between genes measured in high throughput microarray. A comparison between the running time of each phase in both of MapReduce and Spark approaches has been held. Both these techniques can dramatically speed up the computation allowing users to work with highly intensive processing. However, Spark has yielded a better performance than the MapReduce as it performs the processing in the main memory of the worker nodes and avoids the unnecessary I/O operations with the disks. Spark has yielded 80 times speed up for calculating the PCC of 22777 genes, however the MapReduce attained barely 8 times speed up.

Highlights

Gene co-expression networks (GCN) [1] are gaining attention nowadays as useful representations of biologically interesting interactions among genes
Multithreading for each technique splits tasks into threads to execute them at the same time in parallel.The retrieved running time using the proposed MapReduce, and Spark algorithms is shown in Fig. 3, and Fig. 4 correspondingly
MapReduce has a big drawback since it must operate with the entire set of data in the Hadoop Distributed File System (HDFS) on the completion of each task, which in turn increases the time and the cost of processing data, so we found that Spark is faster than MapReduce with 10.2613%

Summary

Introduction

Gene co-expression networks (GCN) [1] are gaining attention nowadays as useful representations of biologically interesting interactions among genes. The correlation between genes can be estimated based on their expression values and can be visualized via networks that reveal the interactions between coexpressed genes. Utilizing such gene expression values is currently effortless using the public accessible genomics data banks for RNA-seq, and high throughput microarrays. There are many paradigms and platforms from the parallel computing technology have been intensively reviewed and compared in previous studies[3],[4], [5],[6], there still an open question in utilizing the big data techniques in the processing of the gene expression profiles and finding their relationships

Methods

Results

Conclusion