Abstract

The maximal information coefficient (MIC) captures both linear and nonlinear correlations between variable pairs. In this paper, we proposed the BackMIC algorithm for MIC estimation. The BackMIC algorithm adds a searching back process on the equipartitioned axis to obtain a better grid partition than the original implementation algorithm ApproxMaxMI. And similar to the ChiMIC algorithm, it terminates the grid search process by the χ2-test instead of the maximum number of bins B(n, α). Results on simulated data show that the BackMIC algorithm maintains the generality of MIC, and gives more reasonable grid partition and MIC values for independent and dependent variable pairs under comparable running times. Moreover, it is robust under different α in B(n, α). MIC calculated by the BackMIC algorithm reveals an improvement in statistical power and equitability. We applied (1-MIC) as the distance measurement in the K-means algorithm to perform a clustering of the cancer/normal samples. The results on four cancer datasets demonstrated that the MIC values calculated by the BackMIC algorithm can obtain better clustering results, indicating the correlations between samples measured by the BackMIC algorithm were more credible than those measured by other algorithms.

Highlights

  • Correlation analysis has important applications in data mining, such as disease diagnosis [1,2], public management [3,4] and financial market analysis [5,6]

  • Based on an equipartition of ny bins on one axis, the BackMIC algorithm locates an optimal partition of the x-axis through the dynamic programming algorithm to achieve the largest normalized mutual information under the restriction of the χ 2-test, which is similar to the ChiMIC algorithm [13]

  • The BackMIC algorithm added a searching back process to obtain an optimal partition for the equipartitioned axis, making it more likely to obtain the true maximal information coefficient (MIC) value

Read more

Summary

Introduction

Correlation analysis has important applications in data mining, such as disease diagnosis [1,2], public management [3,4] and financial market analysis [5,6]. B(n, α) is set, the MIC can only capture simple correlation patterns; by contrast, a high B(n, α) will cause a non-zero score even for independent variables [7] To solve this problem, Chen et al [13]. Proposed the ChiMIC algorithm (downloaded from https://github.com/chenyuan0510/Chi-MIC), in which one axis is equipartitioned, and the partition of other axis is terminated by the χ 2-test. We proposed an improved approximation algorithm called BackMIC for MIC estimation. This algorithm adds a searching back process on the equipartitioned axis to remove the restriction of equipartition and control the search process based on the χ 2-test for both the y- and x-axes. Results on simulated and real data demonstrated that the BackMIC algorithm exhibits better performance in measuring the correlations between independent and dependent variable pairs compared with the AppMIC and ChiMIC algorithms

Comparison of grids and estimated MICs for independent variable pairs
Comparison of grids and estimated MICs for dependent variable pairs
Comparison of robustness
Comparison of statistical power
Comparison of equitability
Comparison of computational cost
Simulated data
Real datasets
Methods
BackMIC algorithm
K-means clustering algorithm
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.