Selective data acquisition in the wild for model charging

Chengliang Chai,Yuyu Luo,Jiabin Liu,Guoliang Li,Nan Tang

doi:10.14778/3523210.3523223

Abstract

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging : given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets ( e.g. , tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points ( i.e. , a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback ( i.e. , reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Selective data acquisition in the wild for model charging

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Mar 1, 2022
Citations: 27

Similar Papers

Intelli-Eye: An UAV Tracking System with Optimized Machine Learning Tasks Offloading
Bo Yang ... Timothy Kroecker
-
Bo Yang, et. al.Bo Yang ... Timothy Kroecker
29 Apr 2019
29 Apr 2019

Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data
Eric Valdez-Valenzuela ... Angel Kuri-Morales
-
Eric Valdez-Valenzuela, et. al.Eric Valdez-Valenzuela ... Angel Kuri-Morales
01 Jan 2020
01 Jan 2020

Toward virtual data scientist with visual means
Boris Kovalerchuk ... Michael Kovalerchuk
-
Boris Kovalerchuk, et. al.Boris Kovalerchuk ... Michael Kovalerchuk
01 May 2017
01 May 2017

Modeling of Supervised Machine Learning using Mechanism of Quantum Computing.
Mukta Nivelkar ... S G Bhirud
Journal of Physics: Conference Series | VOL. 2161
Mukta Nivelkar, et. al.Mukta Nivelkar ... S G Bhirud
01 Jan 2021
Journal of Physics: Conference Series | VOL. 2161

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Selective data acquisition in the wild for model charging

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment