Abstract
The paradigm of AutoML has created an opportunity to enable ML for the masses. Emerging industrial-scale cloud AutoML platforms aim to automate the end-to-end ML workflow. While many works have looked into automated feature engineering, model selection, or hyper-parameter search in AutoML, little work has studied a crucial step that serves as an entry point to this workflow: ML feature type inference. The semantic gap between attribute types (e.g., strings, numbers) in databases/files and ML feature types (e.g., Numeric, Categorical) necessitates type inference. In this work, we formalize and standardize this task by creating the first ever benchmark labeled dataset, which we use to objectively evaluate existing AutoML tools. Our dataset has 9921 examples and a 9-class label vocabulary. Our labeled data also offers an alternative approach to automate this task than existing rule-based or syntax-based approaches: use ML itself to predict feature types. We collate a benchmark suite of 30 classification and regression tasks to assess the importance of type inference for downstream models. Empirical comparison on our labeled data shows that an ML-based approach delivers a lift of an average 14% and up to 38% in accuracy for identifying feature types compared to prominent industrial tools. Our downstream benchmark suite reveals that the ML-based approach outperforms existing industrial-strength tools for 47 out of 60 downstream models. We release our labeled dataset, models, and downstream benchmarks in a public repository with a leaderboard.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.