Abstract

Correlation between gene expression profiles across multiple samples and the identification of inter-gene interactions is a critical technique for Co-expression networking. Due to the highly intensive processing of calculating the Pearson’s Correlation Coefficient, PCC, matrix, it often takes too much processing time to accomplish it. Therefore, in this work, Big Data techniques including MapReduce and Spark have been employed in a cloud environment to calculate the PCC matrix to find the dependencies between genes measured in high throughput microarray. A comparison between the running time of each phase in both of MapReduce and Spark approaches has been held. Both these techniques can dramatically speed up the computation allowing users to work with highly intensive processing. However, Spark has yielded a better performance than the MapReduce as it performs the processing in the main memory of the worker nodes and avoids the unnecessary I/O operations with the disks. Spark has yielded 80 times speed up for calculating the PCC of 22777 genes, however the MapReduce attained barely 8 times speed up.

Highlights

  • Gene co-expression networks (GCN) [1] are gaining attention nowadays as useful representations of biologically interesting interactions among genes

  • Multithreading for each technique splits tasks into threads to execute them at the same time in parallel.The retrieved running time using the proposed MapReduce, and Spark algorithms is shown in Fig. 3, and Fig. 4 correspondingly

  • MapReduce has a big drawback since it must operate with the entire set of data in the Hadoop Distributed File System (HDFS) on the completion of each task, which in turn increases the time and the cost of processing data, so we found that Spark is faster than MapReduce with 10.2613%

Read more

Summary

Introduction

Gene co-expression networks (GCN) [1] are gaining attention nowadays as useful representations of biologically interesting interactions among genes. The correlation between genes can be estimated based on their expression values and can be visualized via networks that reveal the interactions between coexpressed genes. Utilizing such gene expression values is currently effortless using the public accessible genomics data banks for RNA-seq, and high throughput microarrays. There are many paradigms and platforms from the parallel computing technology have been intensively reviewed and compared in previous studies[3],[4], [5],[6], there still an open question in utilizing the big data techniques in the processing of the gene expression profiles and finding their relationships

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call