Abstract

Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing projection methods for data visualization use predefined criteria to choose the representation of data. There is a lack of methods that (i) use information on what the user has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as “this set of points forms a cluster”, and requires no knowledge of maths. This background knowledge is used to find a maximum entropy distribution of the data, after which the user is provided with data projections for which the data and the maximum entropy distribution differ the most, hence showing the user aspects of data that are maximally informative given the background knowledge. We study the computational performance of our model and present use cases on synthetic and real data. We find that the model allows the user to learn information efficiently from various data sources and works sufficiently fast in practice. In addition, we provide an open source EDA demonstrator system implementing our model with tailored interactive visualizations. We conclude that the information theoretic approach to EDA where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system.

Highlights

  • Ever since Tukey’s pioneering work on exploratory data analysis (EDA) (Tukey 1977), the task of effectively exploring data has remained an art as much as a science

  • We present a novel interactive framework for EDA based on solid theoretical principles and taking into account the updating knowledge of the user

  • Preprocess, whitening, sample, and pca always take less than 2 s each and they are not reported in the table

Read more

Summary

Introduction

Ever since Tukey’s pioneering work on exploratory data analysis (EDA) (Tukey 1977), the task of effectively exploring data has remained an art as much as a science. Modern computational methods for dimensionality reduction, such as Projection Pursuit and manifold learning, allow one to spot complex relations from the data automatically and to present them visually. The intuitive idea is that the projection computed shows the maximal difference between the data and the background distribution (i.e., the belief state of the user). Interactive visual data exploration with subjective feedback (a) Background distribution (c) The data in the projection (e) Observed pattern. The new projection displayed is the one that is maximally insightful, considering the updated background distribution. We achieve this through the use of a whitening operation (Kessy et al 2018), which is explained in detail in Sect. The quest to automate the composition of insightful visualizations is important in its own right, as is illustrated in the remainder of the paper

Contributions and outline of the paper
Methods
Preliminaries
Constraints and background distribution
Updating the background distribution
Update rules
About convergence
Whitening operation for finding the most informative visualization
A summary of the proposed interactive framework for EDA
Experiments
Runtime experiment
British National Corpus data
UCI image segmentation data
Proof-of-concept system sideR
Related work
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call