Abstract

With recent developments of data technology in biomedicine, factor data such as diagnosis codes and genomic features, which can have tens to hundreds of discrete and unorderable categorical values, have emerged. While considered as a fundamental problem in statistical analyses, the estimation of probability distribution for such factor variables has not studied much because the previous studies have mainly focused on continuous variables and discrete factor variables with a few categories such as sex and race. In this work, we propose a nonparametric Bayesian procedure to estimate the probability distribution of factors with many categories. The proposed method was demonstrated through simulation studies under various conditions and showed significant improvements on the estimation errors from the previous conventional methods. In addition, the method was applied to the analysis of diagnosis data of intensive care unit patients, and generated interesting medical hypotheses. The overall results indicate that the proposed method will be useful in the analysis of biomedical factor data.

Highlights

  • Factor variables are a common data type in statistical analysis of biomedical data

  • A diagnosis for a patient in electronic health records is represented as a factor variable having one of the thousands of diagnosis codes

  • We propose a Bayesian estimation with optional Polya tree (OPT) priors for the joint probability distribution of multivariate factor variables with many categories, for which kernel approaches cannot be directly applied

Read more

Summary

Introduction

Factor variables are a common data type in statistical analysis of biomedical data. Factor variables that have been considered in traditional biomedical data analyses, such as sex, race, and treatment options, usually have only a few categories. The number of categorical values is often much smaller than the size of observed samples. With technology developments of data generation and accumulation, factor variables that can have many categorical values have emerged in the analyses of various biomedical data. A diagnosis for a patient in electronic health records is represented as a factor variable having one of the thousands of diagnosis codes. Electronic health records of many clinical sites include medical operations and prescribed drugs that can be described by factor variables with thousands of categorical values.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.