Sampling techniques to improve big data exploration

Julian A Ramos Rojas,Stephanie Rosenthal,Mary Beth Kery,Anind Dey

doi:10.1109/ldav.2017.8231848

Abstract

The success of Big Data relies fundamentally on the ability of a person (the data scientist) to make sense and generate insights from this wealth of data. The process of generating actionable insights, called data exploration, is a difficult and time-consuming task. Data exploration of a big dataset usually requires first generating a small and representative data sample that can be easily plotted and viewed, managed and interpreted to generate insights. However, the literature on the topic hints at data scientists only using random sampling with regular sized datasets and it is unclear what they do with Big Data. In this work, we first show evidence from a survey that random sampling is the only technique commonly used by data scientists to quickly gain insights from a big dataset despite theoretical and empirical evidence from the active learning community that suggests benefits of using other sampling techniques. Second, to evaluate and demonstrate the benefits of other sampling techniques, we conducted an online study with 34 data scientists. These scientists performed a data exploration task to support a classification goal using data samples from more than 2 million records of editing data from Wikipedia articles, generated using different sampling techniques. The study results demonstrate that sampling techniques other than random sampling can generate insights that help to focus on different characteristics of the data, without compromising quality in a data exploration.

Full Text