Abstract
SummarySoftware systems that learn from data via machine learning (ML) are being deployed in increasing numbers in real world application scenarios. These ML applications contain complex data preparation pipelines, which take several raw inputs, integrate, filter and encode them to produce the input data for model training. This is in stark contrast to academic studies and benchmarks, which typically work with static, already prepared datasets. It is a difficult and tedious task to ensure at development time that the data preparation pipelines for such ML applications adhere to sound experimentation practices and compliance requirements. Identifying potential correctness issues currently requires a high degree of discipline, knowledge, and time from data scientists, and they often only implement one-off solutions, based on specialised frameworks that are incompatible with the rest of the data science ecosystem.We discuss how to model data preparation pipelines as dataflow computations from relational inputs to matrix outputs, and propose techniques that use record-level provenance to automatically screen these pipelines for many common correctness issues (e.g., data leakage between train and test data). We design a prototypical system to screen such data preparation pipelines and furthermore enable the automatic computation of important metadata such as group fairness metrics. We discuss how to extract the semantics and the data provenance of common artifacts in supervised learning tasks and evaluate our system on several example pipelines with real-world data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.