Abstract

Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.

Highlights

  • Mutual information is a statistic to measure the relatedness between two variables[1]

  • We propose a method to calculate the mutual information between two discrete variables with many categories by using recursive adaptive partitioning

  • We propose a method for the calculation of mutual information between discrete variables with many categories

Read more

Summary

Introduction

Mutual information is a statistic to measure the relatedness between two variables[1]. The calculation of mutual information is straightforward given a joint probability distribution of two variables, in many cases it should be estimated from random samples without knowing the underlying distribution. While the mutual information between discrete and continuous variables has been studied recently[15], mutual information between discrete variables hasn’t been studied much It is mainly because the estimation of the joint probabilities between discrete variables has been considered to be straightforward, just by counting the number of samples in each combination of categories of two variables[15]. Examples of available data sets are listed in Supplementary Table S1 These data types commonly have many categories or discrete values, among which orders and distances are ill-defined

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call