Data Summarization Using Sampling Algorithms: Data Stream Case Study

Rayane El Sibai,Kablan Barbar,Raja Chiky,Jacques Bou Abdo,Yousra Chabchoub,Jacques Demerjian

doi:10.1007/978-3-030-43981-1_6

Abstract

Data streams represent a challenge to the data processing operations such as query execution and information retrieval. They pose many constraints in terms of memory space and execution time for the computation process. This is mainly due to the huge volume of the data and their high arrival rate. Generating approximate answers by using a small proportion of the data stream, called “summary,” is acceptable for many applications. Sampling algorithms are used to construct a data stream summary. The purpose of sampling algorithms is to provide information concerning a large set of data from a representative sample extracted from it. An effective summary of a data stream must have the ability to respond, in an approximate manner, to any query, whatever the period of time investigating. In this chapter, we present a survey of these algorithms. Firstly, we introduce the basic concepts of data streams, windowing models, as well as data stream applications. Next, we introduce the state of the art of different sampling algorithms used in data stream environments. We classify these algorithms according to the following metrics: number of passes over the data, memory consumption, and skewing ability. In the end, we evaluate the performance of three sampling algorithms according to their execution time and accuracy.

Full Text