Towards Benchmarking Feature Type Inference for AutoML Platforms

Vraj Shah,Jonathan Lacanlale,Kevin Yang,Arun Kumar,Premanand Kumar

doi:10.1145/3448016.3457274

Abstract

The paradigm of AutoML has created an opportunity to enable ML for the masses. Emerging industrial-scale cloud AutoML platforms aim to automate the end-to-end ML workflow. While many works have looked into automated feature engineering, model selection, or hyper-parameter search in AutoML, little work has studied a crucial step that serves as an entry point to this workflow: ML feature type inference. The semantic gap between attribute types (e.g., strings, numbers) in databases/files and ML feature types (e.g., Numeric, Categorical) necessitates type inference. In this work, we formalize and standardize this task by creating the first ever benchmark labeled dataset, which we use to objectively evaluate existing AutoML tools. Our dataset has 9921 examples and a 9-class label vocabulary. Our labeled data also offers an alternative approach to automate this task than existing rule-based or syntax-based approaches: use ML itself to predict feature types. We collate a benchmark suite of 30 classification and regression tasks to assess the importance of type inference for downstream models. Empirical comparison on our labeled data shows that an ML-based approach delivers a lift of an average 14% and up to 38% in accuracy for identifying feature types compared to prominent industrial tools. Our downstream benchmark suite reveals that the ML-based approach outperforms existing industrial-strength tools for 47 out of 60 downstream models. We release our labeled dataset, models, and downstream benchmarks in a public repository with a leaderboard.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards Benchmarking Feature Type Inference for AutoML Platforms

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Automatic Machine Learning Method for Hyper-parameter Search
Minglan Su ... Jianxiu Wang
Journal of Physics: Conference Series | VOL. 1802
Minglan Su, et. al.Minglan Su ... Jianxiu Wang
01 Mar 2021
Journal of Physics: Conference Series | VOL. 1802

Dirichlet-Derived Multiple Topic Scene Classification Model for High Spatial Resolution Remote Sensing Imagery
Bei Zhao ... Liangpei Zhang
IEEE Transactions on Geoscience and Remote Sensing | VOL. 54
Bei Zhao, et. al.Bei Zhao ... Liangpei Zhang
01 Apr 2016
IEEE Transactions on Geoscience and Remote Sensing | VOL. 54

Evolutionary Automated Feature Engineering
Guanghui Zhu ... Yihua Huang
-
Guanghui Zhu, et. al.Guanghui Zhu ... Yihua Huang
01 Jan 2021
01 Jan 2021

A Framework for Strong Typing and Type Inference in (Persistent) Object Models
Alfons Kemper ... Guido Moerkotte
-
Alfons Kemper, et. al.Alfons Kemper ... Guido Moerkotte
01 Jan 1991
01 Jan 1991

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Benchmarking Feature Type Inference for AutoML Platforms

Abstract

Talk to us

Similar Papers