Bayesian Hyperparameter Optimization - Relational and Scalable Surrogate Models for Hyperparameter Optimization Across Problem Instances

Nicolas Schilling

doi:10.25528/016

Abstract

Machine learning is often confronted with the problem of learning prediction models on a set of observed data points. Given an expressive data set of the problem to solve, using powerful models and learning algorithms is only hindered by setting the right configurations for both. Unfortunately, the magnitude of the performance difference is large, which makes choosing right configurations an additional problem that is only solved by experienced practitioners. In this thesis, we will address the problem of hyperparameter optimization for machine learning and present ways to solve it. We firstly introduce the problem of supervised machine learning. We then discuss many examples of hyperparameter configurations that can be considered prior to learning the model. Afterwards, we introduce methods on finding the right configurations, especially those methods that work in the scheme of Bayesian optimization, which is a framework for optimizing black-box functions. Black-boxes are functions where for a given input one can only observe an output after running a costly procedure. Usually, in black-box optimization so-called surrogate models are learned to reconstruct the observations to then offer a prediction for unobserved configurations. Fortunately, recent outcomes show that transfering the knowledge across problems, for example by learning surrogates across different data sets being solved by the same model class, shows promising results. We tackle the problem of hyperparameter optimization in mainly two different ways. At first, we consider the problem of hyperparameter optimization as a recommendation problem, where we want to learn data set features as well as their interaction with the hyperparameter configurations as latent features in a factorization based approach. We build a surrogate model that is inspired by the complexity of neural networks as well as the ability to learn latent embeddings as in factorization machines. Secondly, as the amount of meta knowledge increases every day, surrogate models need to be scalable. We consider Gaussian processes, as they themselves are hyperparameter free and work very well in most hyperparameter optimization cases. Unfortunately, they are not scalable, as a matrix in the size of the number of data points has to be inverted for inference. We show various methods of simplifying a Gaussian process by using an ensemble of Gaussian process experts, which is much faster to learn due to its paralellization properties while still showing very competitive performance. We conclude the thesis by discussing the aspect of learning across problems in more detail than simply learning across different data sets. By learning hyperparameter performance across different models, we show that also model choice can be handled by the proposed algorithms. Additionally, we show that hyperparameter performance can even be transfered across different problem tasks, for example from classification to regression.

Full Text