Automated Provenance-Based Screening of ML Data Preparation Pipelines

Sebastian Schelter,Shubha Guha,Stefan Grafberger

doi:10.1007/s13222-024-00483-4

Abstract

SummarySoftware systems that learn from data via machine learning (ML) are being deployed in increasing numbers in real world application scenarios. These ML applications contain complex data preparation pipelines, which take several raw inputs, integrate, filter and encode them to produce the input data for model training. This is in stark contrast to academic studies and benchmarks, which typically work with static, already prepared datasets. It is a difficult and tedious task to ensure at development time that the data preparation pipelines for such ML applications adhere to sound experimentation practices and compliance requirements. Identifying potential correctness issues currently requires a high degree of discipline, knowledge, and time from data scientists, and they often only implement one-off solutions, based on specialised frameworks that are incompatible with the rest of the data science ecosystem.We discuss how to model data preparation pipelines as dataflow computations from relational inputs to matrix outputs, and propose techniques that use record-level provenance to automatically screen these pipelines for many common correctness issues (e.g., data leakage between train and test data). We design a prototypical system to screen such data preparation pipelines and furthermore enable the automatic computation of important metadata such as group fairness metrics. We discuss how to extract the semantics and the data provenance of common artifacts in supervised learning tasks and evaluate our system on several example pipelines with real-world data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Automated Provenance-Based Screening of ML Data Preparation Pipelines

Abstract

Talk to us

Similar Papers

More From: Datenbank-Spektrum

Lead the way for us

Journal: Datenbank-Spektrum	Publication Date: Sep 30, 2024
License type: CC BY 4.0

Similar Papers

Machine learning in pain research.
Jörn Lötsch ... Alfred Ultsch
Pain | VOL. 159
Jörn Lötsch, et. al.Jörn Lötsch ... Alfred Ultsch
24 Nov 2017
Pain | VOL. 159

Tool Support for Improving Software Quality in Machine Learning Programs
Kwok Sun Cheng ... Pei-Chi Huang
Information | VOL. 14
Kwok Sun Cheng, et. al.Kwok Sun Cheng ... Pei-Chi Huang
16 Jan 2023
Information | VOL. 14

Algorithmic fairness in computational medicine.
Jie Xu ... Jiang Bian
eBioMedicine | VOL. 84
Jie Xu, et. al.Jie Xu ... Jiang Bian
06 Sep 2022
eBioMedicine | VOL. 84

Machine-Learning Implementation in Clinical Anesthesia: Opportunities and Challenges.
Danton S Char ... Alyssa Burgart
Anesthesia & Analgesia | VOL. 130
Danton S Char, et. al.Danton S Char ... Alyssa Burgart
01 Jun 2020
Anesthesia & Analgesia | VOL. 130

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated Provenance-Based Screening of ML Data Preparation Pipelines

Abstract

Talk to us

Similar Papers

More From: Datenbank-Spektrum