Clustering datasets with demographics and diagnosis codes.

Haodi Zhong,Grigorios Loukides,Robert Gwadera

doi:10.1016/j.jbi.2019.103360

Haodi Zhong, Grigorios Loukides + Show 1 more

Open Access

https://doi.org/10.1016/j.jbi.2019.103360

Copy DOI

Journal: Journal of Biomedical Informatics	Publication Date: Jan 3, 2020
Citations: 9	License type: elsevier-specific: oa user license

Affiliation: King's College London, Cardiff University

Abstract

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

Full Text