Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams

Edouard Fouché

doi:10.5445/ir/1000127232

Abstract

Data Mining – known as the process of extracting knowledge from massive data sets – leads to phenomenal impacts on our society, and now affects nearly every aspect of our lives: from the layout in our local grocery store, to the ads and product recommendations we receive, the availability of treatments for common diseases, the prevention of crime, or the efficiency of industrial production processes. However, Data Mining remains difficult when (1) data is high-dimensional, i.e., has many attributes, and when (2) data comes as a stream. Extracting knowledge from high-dimensional data streams is impractical because one must cope with two orthogonal sets of challenges. On the one hand, the effects of the so-called curse of dimensionality bog down the performance of statistical methods and yield to increasingly complex Data Mining problems. On the other hand, the statistical properties of data streams may evolve in unexpected ways, a phenomenon known in the community as concept drift. Thus, one needs to update their knowledge about data over time, i.e., to monitor the stream. While previous work addresses high-dimensional data sets and data streams to some extent, the intersection of both has received much less attention. Nevertheless, extracting knowledge in this setting is advantageous for many industrial applications: identifying patterns from high-dimensional data streams in real-time may lead to larger production volumes, or reduce operational costs. The goal of this dissertation is to bridge this gap. We first focus on dependency estimation, a fundamental task of Data Mining. Typically, one estimates dependency by quantifying the strength of statistical relationships. We identify the requirements for dependency estimation in high-dimensional data streams and propose a new estimation framework, Monte Carlo Dependency Estimation (MCDE), that fulfils them all. We show that MCDE leads to efficient dependency monitoring. Then, we generalise the task of monitoring by introducing the Scaling Multi-Armed Bandit (S-MAB) algorithms, extending the Multi-Armed Bandit (MAB) model. We show that our algorithms can efficiently monitor statistics by leveraging user-specific criteria. Finally, we describe applications of our contributions to Knowledge Discovery. We propose an algorithm, Streaming Greedy Maximum Random Deviation (SGMRD), which exploits our new methods to extract patterns, e.g., outliers, in high-dimensional data streams. Also, we present a new approach, that we name kj-Nearest Neighbours (kj-NN), to detect outlying documents within massive text corpora. We support our algorithmic contributions with theoretical guarantees, as well as extensive experiments against both synthetic and real-world data. We demonstrate the benefits of our methods against real-world use cases. Overall, this dissertation establishes fundamental tools for Knowledge Discovery in high-dimensional data streams, which help with many applications in the industry, e.g., anomaly detection, or predictive maintenance. To facilitate the application of our results and future research, we publicly release our implementations, experiments, and benchmark data via open-source platforms.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

IPMOD: An efficient outlier detection model for high-dimensional medical data streams
Yun Yang ... Honglin Xiong
Expert Systems with Applications | VOL. 191
Yun Yang, et. al.Yun Yang ... Honglin Xiong
30 Nov 2021
Expert Systems with Applications | VOL. 191

Online monitoring of high-dimensional binary data streams with application to extreme weather surveillance
Zhiwen Fang ... Dongdong Xiang
Journal of Applied Statistics | VOL. 49
Zhiwen Fang, et. al.Zhiwen Fang ... Dongdong Xiang
04 Sep 2021
Journal of Applied Statistics | VOL. 49

Tensor-Based Temporal Control for Partially Observed High-Dimensional Streaming Data
Zihan Zhang ... Jianjun Shi
Technometrics | VOL. 66
Zihan Zhang, et. al.Zihan Zhang ... Jianjun Shi
14 Oct 2023
Technometrics | VOL. 66

Actionable intelligence and online learning for semantic computing
Cem Tekin ... Mihaela Van Der Schaar
Encyclopedia with Semantic Computing and Robotic Intelligence | VOL. 01
Cem Tekin, et. al.Cem Tekin ... Mihaela Van Der Schaar
01 Mar 2017
Encyclopedia with Semantic Computing and Robotic Intelligence | VOL. 01

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams

Abstract

Talk to us

Similar Papers