Time series subsequence matching (or signal searching) has importance in a variety of areas in health care informatics. These areas include case-based diagnosis and treatment as well as the discovery of trends and correlations between data. Much of the traditional research in signal searching has focused on high dimensional R-NN matching. However, the results of R-NN are often small and yield minimal information gain; especially with higher dimensional data. This paper proposes a randomized Monte Carlo sampling method to broaden search criteria such that the query results are an accurate sampling of the complete result set. The proposed method is shown both theoretically and empirically to improve information gain. The number of query results are increased by several orders of magnitude over approximate exact matching schemes and fall within a Gaussian distribution. The proposed method also shows excellent performance as the majority of overhead added by sampling can be mitigated through parallelization. Experiments are run on both simulated and real-world biomedical datasets.
Read full abstract