FeatureHub: Towards Collaborative Data Science

Micah J Smith,Roy Wedge,Kalyan Veeramachaneni

doi:10.1109/dsaa.2017.66

Abstract

Feature engineering is a critical step in a successful data science pipeline. This step, in which raw variables are transformed into features ready for inclusion in a machine learning model, can be one of the most challenging aspects of a data science effort. We propose a new paradigm for feature engineering in a collaborative framework and instantiate this idea in a platform, FeatureHub. In our approach, independent data scientists collaborate on a feature engineering task, viewing and discussing each others' features in real-time. Feature engineering source code created by independent data scientists is then integrated into a single predictive machine learning model. Our platform includes an automated machine learning backend which abstracts model training, selection, and tuning, allowing users to focus on feature engineering while still receiving immediate feedback on the performance of their features. We use a tightly-integrated forum, native feature discovery APIs, and targeted compensation mechanisms to facilitate and incentivize collaboration among data scientists. This approach can reduce the redundancy from independent or competitive data scientists while decreasing time to task completion. In experimental results, automatically generated models using crowdsourced features show performance within 0.03 or 0.05 points of winning submissions, with minimal human oversight.

Full Text