MiCS-P:Parallel mutual-information computation of big categorical data on spark

Junli Li,Chaowei Zhang,Jifu Zhang,Xiao Qin,Lihua Hu

doi:10.1016/j.jpdc.2021.12.002

Junli Li, Chaowei Zhang + Show 3 more

Open Access

PDF Available

https://doi.org/10.1016/j.jpdc.2021.12.002

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Mutual information can effectively measure the correlation between categorical attributes. However, it is found to be quite computationally intensive and time consuming process for enormous size and different distribution data sets. It involves steps for computation of marginal entropies, probability distribution and so on. Spark is a fast, general-purpose parallel framework designed specifically for large-scale data processing. Main motivation of this paper is to provide an intelligent method for parallel mutual information calculation based on Spark computing environment with maintaining the synchronization between different computing nodes. Proposed method named MiCS-P has been able to execute with different number of computing nodes, and gives significant speedup working with different dimensions and sizes of data sets. The MiCS-P algorithm adopts column-wise transformation scheme, which is conducive to the calculation of mutual information between a large number of feature pairs. And to alleviate imbalanced load causing long execution times, we implement a two-phase virtual partitioning scheme running on Spark.

Full Text