A framework for expert-driven subpopulation discovery and evaluation using subspace clustering for epidemiological data

Tommy Hielscher,Uli Niemann,Bernhard Preim,Henry Völzke,Till Ittermann,Myra Spiliopoulou

doi:10.1016/j.eswa.2018.07.003

Abstract

Objective: We propose an intelligent system that assists epidemiology experts in analysing the data of a population-based epidemiological study, in identifying relevant variables for an outcome and subpopulations with increased disease prevalence, and in validating the findings concerning variables and subpopulations in a further, expert-specified cohort. At present, the study of an outcome on a population-based cohort is hypothesis-driven, i.e. the expert must specify the variables to be studied. Our approach rather operates in a data-driven, semi-automated way, enabling the expert to identify variables of relevance and generate hypotheses on them.Methods: Our system DIVA supports the Discovery, Inspection and VAlidation of subpopulations with increased prevalence of an outcome, without requiring parameter tuning. DIVA takes as input the cohort of an epidemiological population-based study with all variables specified in the study’s protocol, as well as inputs from the expert on the similarity of a small number of cohort participants. DIVA uses semi-supervised subspace clustering and subspace construction to identify sets of variables – subspaces – that promote participant similarity with respect to the outcome and with respect to the expert inputs, and then discovers subpopulations with increased outcome prevalence in those subspaces (DIVA module “DRESS”). DIVA uses visual analytics techniques to assist the expert in juxtaposing, filtering and inspecting the characteristics of these subpopulations (web-based DIVA module “D-INSPECTOR”). If the expert has access to a second cohort on a comparable population, DIVA aligns the cohort used for discovery to this second cohort, and then checks whether the subpopulations found in the original cohort are also present in the second one (DIVA module “VALIDATOR”).Results:We applied DIVA to the third wave (SHIP-2) of the SHIP-CORE cohort of the Study of Health in Pomerania (Völzke et al., 2011) for the liver disorder “hepatic steatosis”, and on the first wave (TREND-0) of the SHIP-TREND cohort of the same study for the thyroid gland disorder “goitre”. We found that most of the subpopulations extracted automatically, and subsequently ranked and filtered by the modules of DIVA, had significantly higher disease prevalence than the general population. We varied the amount of inputs needed from the expert to drive the subpopulation extraction process and found that a very small amount of information, namely the outcome of as few as 4 cohort participants, is sufficient for the identification of several relevant variables and subpopulations. We used a subset of TREND-0 for the validation on goitre and the complete TREND-0 for the validation on hepatic steatosis and found that the significant difference in prevalence for the identified subpopulation also holds in the validation data.Conclusions: We have shown that DIVA discovers subpopulations and variables of importance with respect to an outcome, while requiring a very small amount of expert inputs. Each combination of variables and each subpopulation corresponds to a hypothesis, the validation of which would have required substantial human effort. Thus, DIVA allows for a more effective exploitation of population-based data, not fully automated but driven by the expert and without the need for technical parameter tuning.A shortcoming of DIVA design is the demand of a specific type of expert inputs, namely “constraints” on the similarity of pairs of participants. Currently, we generate the constraints with a naive utility that is based on random sampling, but we work on the development of an interactive algorithm that would allow the epidemiology expert to inspect a small choice of study participant and give statements on their similarity.The present version of DIVA considers a single wave of the cohort data, ignoring the evolution of the population during the horizon of the study. Hence, subspace and subpopulation discovery do not take account of changes in the importance of variables. We currently work on the incorporation of algorithms that derive additional variables from the longitudinal data and use them in the Discovery module.

Full Text