Abstract

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.

Highlights

  • The expansion of Internet of Things is increasing the importance of outlier detection in streaming data

  • We evaluated MiLOF, DILOF, and time-aware and density-summarizing incremental LOF (TADILOF) in terms of AUC and execution time on various datasets

  • We set K at 8, which was used in DILOF [6]

Read more

Summary

Introduction

The expansion of Internet of Things is increasing the importance of outlier detection in streaming data. The local outlier factor, LOF, proposed in [3], is a well-known density-based algorithm for the detection of local outliers in static data. LOF measures the local deviation of data points with respect to their K nearest neighbors, where K is a user-defined parameter This kind of method can be useful in several applications, such as detecting fraudulent transactions, intrusion detection, direct marketing, and medical diagnostics. To handle the data streams, the algorithms utilize a fixed window size to limit the number of data points held in memory by summarizing previous data points These recent studies base their summaries only on the distribution of previous data; i.e., they do not take the sequence of data into account. The authors discuss the approach for parameter reduction for density-based clustering. The authors utilize several machine learning approaches and outlier detection for different preprocessing tasks

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call