Abstract

Mutual information can effectively measure the correlation between categorical attributes. However, it is found to be quite computationally intensive and time consuming process for enormous size and different distribution data sets. It involves steps for computation of marginal entropies, probability distribution and so on. Spark is a fast, general-purpose parallel framework designed specifically for large-scale data processing. Main motivation of this paper is to provide an intelligent method for parallel mutual information calculation based on Spark computing environment with maintaining the synchronization between different computing nodes. Proposed method named MiCS-P has been able to execute with different number of computing nodes, and gives significant speedup working with different dimensions and sizes of data sets. The MiCS-P algorithm adopts column-wise transformation scheme, which is conducive to the calculation of mutual information between a large number of feature pairs. And to alleviate imbalanced load causing long execution times, we implement a two-phase virtual partitioning scheme running on Spark.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call