Abstract

Data scientists are tasked with obtaining insights from data. However, suitable data is often not immediately at hand, and there may be many potentially relevant datasets in a data lake or in open data repositories. As a result, data discovery and exploration are necessary, but often time consuming, steps in a data analysis workflow. Data discovery is the process of identifying datasets that may meet an information need. Data exploration is the process of understanding the properties of candidate datasets and the relationships between them. Data discovery and data exploration often go hand in hand and benefit from tool support. This article surveys research areas that can contribute to data discovery and exploration, particularly considering dataset search, data navigation, data annotation and schema inference. For each of these areas, we identify key dimensions that can be used to characterize approaches and the values they can hold, and apply the dimensions to describe and compare prominent results. In addition, by surveying several adjacent areas that are often considered in isolation, we identify recurring techniques and alternative approaches to related challenges, thereby placing results within a wider context than is generally considered.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call