Computational subphenotyping, a data-driven approach to understanding disease subtypes, is a prominent topic in medical research. Numerous ongoing studies are dedicated to developing advanced computational subphenotyping methods for cross-sectional data. However, the potential of time-series data has been underexplored until now. Here, we propose a Multivariate Levenshtein Distance (MLD) that can account for address correlation in multiple discrete features over time-series data. Our algorithm has two distinct components: it integrates an optimal threshold score to enhance the sensitivity in discriminating between pairs of instances, and the MLD itself. We have applied the proposed distance metrics on the k-means clustering algorithm to derive temporal subphenotypes from time-series data of biomarkers and treatment administrations from 1039 critically ill patients with COVID-19 and compare its effectiveness to standard methods. In conclusion, the Multivariate Levenshtein Distance metric is a novel method to quantify the distance from multiple discrete features over time-series data and demonstrates superior clustering performance among competing time-series distance metrics.
Read full abstract