Implementation and evaluation of a multivariate abstraction-based, interval-based dynamic time-warping method as a similarity measure for longitudinal medical records

Yuval Shahar,Matan Lion

doi:10.1016/j.jbi.2021.103919

Abstract

ObjectivesA common prerequisite for tasks such as classification, prediction, clustering and retrieval of longitudinal medical records is a clinically meaningful similarity measure that considers both [multiple] variable (concept) values and their time. Currently, most similarity measures focus on raw, time-stamped data as these are stored in a medical record. However, clinicians think in terms of clinically meaningful temporal abstractions, such as “decreasing renal functions”, enabling them to ignore minor time and value variations and focus on similarities among the clinical trajectories of different patients. Our objective was to define an abstraction- and interval-based methodology for matching longitudinal, multivariate medical records, and rigorously assess its value, versus the option of using just the raw, time-stamped data. MethodsWe have developed a new methodology for determination of the relative distance between a pair of longitudinal records, by extending the known dynamic time warping (DTW) method into an interval-based dynamic time warping (iDTW) methodology. The iDTW methodology includes (A): A three-steps interval-based representation (iRep) method: [1] abstracting the raw, time-stamped data of the longitudinal records into clinically meaningful interval-based abstractions, using a domain-specific knowledge base, [2] scoping the period of comparison of the records, [3] creating from the intervals a symbolic time series, by partitioning them into a predetermined temporal granularity; (B) An interval-based matching (iMatch) method to match each relevant pair of multivariate longitudinal records, each represented as multiple series of short symbolic intervals in the determined temporal granularity, using a modified DTW version. EvaluationThree classification or prediction tasks were defined: (1) classifying 161 records of oncology patients as having had autologous versus allogenic bone-marrow transplantation; (2) classifying the longitudinal records of 125 hepatitis patients as having B or C hepatitis; and (3) predicting micro- or macro-albuminuria in the second year, for 151 diabetes patients who were followed for five years. The raw, time-stamped, multivariate data within each medical record, for one, two, or three concepts out of four or five concepts judged as relevant in each medical domain, were abstracted into clinically meaningful intervals using the Knowledge-Based Temporal-Abstraction method, using previously acquired knowledge. We focused on two temporal-abstraction types: (1) State abstractions, which discretize a concept’s raw value into a predetermined range (e.g., LOW or HIGH Hemoglobin); and (2) Gradient abstractions, which indicate the trend of the concept’s value (e.g., INCREASING, DECREASING Hemoglobin value). We created all of the combinations of either uni-dimensional (State or Gradient) or multi-dimensional (State and Gradient) abstractions, of all of the concepts used. Classification of a record was determined by using a majority of the k-Nearest-Neighbors (KNN) of the given record, k ranging over the odd numbers (to break ties) from 1 to N, N being the size of the training set. We have experimented with all possible configurations of the parameters that our method uses. Overall, a total of 75,936 experiments were performed: 33,600 in the Oncology domain, 28,800 in the Hepatitis domain, and 13,536 in the Diabetes domain. Each experiment involved the performance of a 10-fold Cross Validation to compute the mean performance of a particular iDTW method-configuration set of settings, for a specific subset of one, two, or three concepts out of all of the domain-specific concepts relevant to the classification or prediction task on which the experiment focuses. We measured for each such experimental combination the Area Under the Curve (AUC) and the optimal Specificity/Sensitivity ratio using Youden's Index. We then aggregated the experiments by the types of unidimensional or multidimensional abstractions used in them (including the use of only raw concepts as a special case); for example, two state abstractions of different concepts, and one gradient abstraction of a third concept. We compared the mean AUC when using each such feature representation, or combination of abstractions, across all possible method-setting configurations, to the mean AUC when using as a feature representation, for the same task, only raw concepts, also across all possible method-setting configurations. Finally, we applied a paired t-test, to determine whether the mean difference between the accuracy of each temporal-abstraction representation, across all concept and configuration combinations, and the respective raw-concept combinations, across all concept subset and configuration combinations, is significant (P < 0.05). ResultsThe mean performance of the classification and prediction tasks when using, as a feature representation, the various temporal-abstraction combinations, was significantly higher than that performance when using only raw data. Furthermore, in each domain and task, there existed at least one representation using interval-based abstractions whose use led, on average (over all concept subset combinations and method configurations) to a significantly better performance than the use of only subsets of the raw time-stamped data. In seven of nine combinations of domain type (out of three) and number of concepts used (one, two, or three), the variance of the AUCs (for all representations and configurations) was considerably higher across all raw-concept subsets, compared to all abstract combinations. Increasing the number of features used by the matching task enhanced performance. Using multi-dimensional abstractions of the same concept further enhanced the performance. When using only raw data, increasing the number of neighbors monotonically increased the mean performance (over all concept combinations and method configurations) until reaching an optimal saddle-point aroundN; when using abstractions, however, optimal mean performance was often reached after matching only five nearest neighbors. ConclusionsUsing multivariate and multidimensional interval-based, abstraction-based similarity measures is feasible, and consistently and significantly improved the mean classification and prediction performance in time-oriented domains, using DTW-inspired methods, compared to the use of only raw, time-stamped data. It also made the KNN classification more effective. Nevertheless, although the mean performance for the abstract representations was higher than the mean performance when using only raw-data concepts, the actual optimal classification performance in each domain and task depends on the choice of the specific raw or abstract concepts used as features.

Full Text