AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

Katarzyna Musial,Marc-André Zöller,Tomasz Maszczyk,Bogdan Gabrys,Tien-Dung Nguyen

doi:10.1007/978-3-030-44584-3_28

Katarzyna Musial, Marc-André Zöller + Show 3 more

Open Access

https://doi.org/10.1007/978-3-030-44584-3_28

Copy DOI

Publication Date: Jan 1, 2020
Citations: 10	License type: CC BY 4.0

Affiliation: University of Technology Sydney, Software (Germany)

Abstract

The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution.

Highlights

Automatic machine learning (AutoML) has been studied to automate the process of data analytics to collect and integrate data, compose and optimise ML pipelines, and deploy and maintain predictive models [1–3]
Because the evaluation of ML pipelines is expensive in certain cases in the context of AutoML, we propose the AVATAR2 to speed up the process by evaluating their surrogate pipelines
This surrogate pipeline keeps the structure of the original pipeline, replaces the datasets in the form of data matrices by the matrices of transformed-features, and the ML algorithms by transition functions to calculate the output from the input tokens

Summary

Introduction

Automatic machine learning (AutoML) has been studied to automate the process of data analytics to collect and integrate data, compose and optimise ML pipelines, and deploy and maintain predictive models [1–3]. Many existing studies proposed methods to tackle the problem of pipeline composition and optimisation [2,4–9], these methods have two main drawbacks. The pipelines’ structures, which define the executed order of the pipeline components, use fixed templates [2,5]. Using fixed structures can reduce the number of invalid pipelines during the composition and optimisation, these approaches limit the exploration of promising pipelines which may have a variety of structures. While evolutionary algorithms based methods [4] enable the randomness of the pipelines’ structure using the concept of evolution, this randomness tends to construct more invalid pipelines than valid ones.

Methods

Results

Conclusion