Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

Sijie Dong,Qitong Wang,Soror Sahri,Divesh Srivastava,Themis Palpanas

doi:10.14778/3681954.3681984

Abstract

Despite the increasing success of Machine Learning (ML) techniques in real-world applications, their maintenance over time remains challenging. In particular, the prediction accuracy of deployed ML models can suffer due to significant changes between training and serving data over time, known as data drift. Traditional data drift solutions primarily focus on detecting drift, and then retraining the ML models, but do not discern whether the detected drift is harmful to model performance. In this paper, we observe that not all data drifts lead to degradation in prediction accuracy. We then introduce a novel approach for identifying portions of data distributions in serving data where drift can be potentially harmful to model performance, which we term Data Distributions with Low Accuracy (DDLA). Our approach, using decision trees, precisely pinpoints low-accuracy zones within ML models, especially Blackbox models. By focusing on these DDLAs, we effectively assess the impact of data drift on model performance and make informed decisions in the ML pipeline. In contrast to existing data drift techniques, we advocate for model retraining only in cases of harmful drifts that detrimentally affect model performance. Through extensive experimental evaluations on various datasets and models, our findings demonstrate that our approach significantly improves cost-efficiency over baselines, while achieving comparable accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Similar Papers

Toward Rapid Development and Deployment of Machine Learning Pipelines across Cloud-Edge
Anirban Bhattacharjee ... Thomas Damiano
-
Anirban Bhattacharjee, et. al.Anirban Bhattacharjee ... Thomas Damiano
12 Aug 2021
12 Aug 2021

XAutoML: A Visual Analytics Tool for Understanding and Validating Automated Machine Learning
Marc-André Zöller ... Waldemar Titov
ACM Transactions on Interactive Intelligent Systems | VOL. 13
Marc-André Zöller, et. al.Marc-André Zöller ... Waldemar Titov
08 Dec 2023
ACM Transactions on Interactive Intelligent Systems | VOL. 13

A Comprehensive Machine Learning Benchmark Study for Radiomics-Based Survival Analysis of CT Imaging Data in Patients With Hepatic Metastases of CRC.
Anna Theresa Stüber ... David Rügamer
Investigative Radiology | VOL. 58
Anna Theresa Stüber, et. al.Anna Theresa Stüber ... David Rügamer
28 Jul 2023
Investigative Radiology | VOL. 58

AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model
Tien-Dung Nguyen ... Tomasz Maszczyk
-
Tien-Dung Nguyen, et. al.Tien-Dung Nguyen ... Tomasz Maszczyk
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment