Unit Testing Data with Deequ

Sebastian Schelter,Felix Biessmann,Andrey Taptunov,Tammo Rukat,Pierre Brunelle,Phillipp Schmidt,Stephan Seufert,Dustin Lange

doi:10.1145/3299869.3320210

Abstract

Modern companies and institutions rely on data to guide every single decision. Missing or incorrect information seriously compromises any decision process. We demonstrate "Deequ", an Apache Spark-based library for automating the verification of data quality at scale. This library provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables "unit tests for data". Deequ is available as open source, meets the requirements of production use cases at Amazon, and scales to datasets with billions of records if the constraints to evaluate are chosen carefully. Our demonstration walks attendees through a fictitious business use case of validating daily product reviews from a public dataset, and is executed in a proprietary interactive notebook environment. We show attendees how to define data unit tests from automatically suggested constraints and how to create customized tests. Additionally, we demonstrate how to apply Deequ to validate incrementally growing datasets, and give examples of how to configure anomaly detection algorithms on time series of data quality metrics to further automate the data validation.

Full Text