Fast Anomaly Detection in Dynamic Clinical Datasets Using Near-Optimal Hashing with Concentric Expansions

Zeeshan Syed,Ilan Rubinfeld

doi:10.1109/icdmw.2010.88

Abstract

While rare clinical events, by definition, occur infrequently in a population, the consequences of these events can often be drastic. Unfortunately, developing risk stratification algorithms for these conditions typically requires collecting large volumes of data to capture enough positive and negative cases for training. This process is slow, expensive, and often burdensome to both patients and caregivers. In this paper, we propose an unsupervised machine learning approach to address this challenge and risk stratify patients for adverse outcomes without use of {\it a priori} knowledge or labeled training data. The key idea of our approach is to identify high risk patients as anomalies in a population (i.e., patients lying in sparse regions of the feature space). We identify these cases through a novel algorithm that finds an approximate solution to the k-nearest neighbor problem using locality sensitive hashing (LSH) based on p-stable distributions. Our algorithm is optimized to use multiple LSH searches, each with a geometrically increasing radius, to find the k-nearest neigbors of patients in a dynamically changing dataset where patients are being added or removed over time. When evaluated on data from the National Surgical Quality Improvement Program (NSQIP), this approach was able to successfully identify patients at an elevated risk of mortality and rare morbidities. The LSH-based algorithm provided a substantial improvement over an exact k-nearest neighbor algorithm in runtime, while achieving a similar accuracy.

Full Text