Abstract

Encoding domain knowledge into the prior over the high-dimensional weight space of a neural network is challenging but essential in applications with limited data and weak signals. Two types of domain knowledge are commonly available in scientific applications: 1. feature sparsity (fraction of features deemed relevant); 2. signal-to-noise ratio, quantified, for instance, as the proportion of variance explained. We show how to encode both types of domain knowledge into the widely used Gaussian scale mixture priors with Automatic Relevance Determination. Specifically, we propose a new joint prior over the local (i.e., feature-specific) scale parameters that encodes knowledge about feature sparsity, and a Stein gradient optimization to tune the hyperparameters in such a way that the distribution induced on the model’s proportion of variance explained matches the prior distribution. We show empirically that the new prior improves prediction accuracy compared to existing neural network priors on publicly available datasets and in a genetics application where signals are weak and sparse, often outperforming even computationally intensive cross-validation for hyperparameter tuning.

Highlights

  • Neural networks (NNs) have achieved state-of-the-art performance on a wide range of supervised learning tasks with high a signal-to-noise ratio (S/N), such as computer vision (Krizhevsky et al, 2012) and natural language processing (Devlin et al, 2018)

  • In the Supplementary, we develop a novel Monte Carlo approach to model the log-linear relationship between the global scale of the Mean-Field Gaussian prior and the prediction variance of the Bayesian neural network (BNN) to avoid computationally expensive grid search, and we use this to set the variance according to a point estimate of the proportion of variance explained (PVE), but we find that the resulting nonhierarchical Gaussian prior is not flexible enough

  • The true PVE is unavailable as a prior knowledge, a less informative prior over PVE (e.g., U [0, 1] in HMF+PVE) provides sufficient probability density on the true PVE compared with HMF, whose induced prior PVE is highly concentrated on 1 and gives almost 0 probability density on the true PVE

Read more

Summary

Introduction

Neural networks (NNs) have achieved state-of-the-art performance on a wide range of supervised learning tasks with high a signal-to-noise ratio (S/N), such as computer vision (Krizhevsky et al, 2012) and natural language processing (Devlin et al, 2018). Question how to encode domain knowledge into the prior over Bayesian neural network (BNN) weights, which are often high-dimensional and uninterpretable. We propose determining the hyper-priors according to two types of domain knowledge often available in scientific applications: ballpark figures on feature sparsity and the signal-to-noise ratio. We propose a novel informative hyper-prior over the feature inclusion indicators τi(l), called informative spike-and-slab, which can directly model any distribution on the number of relevant features (Figure 1a). The distribution of PVE assumed by a BNN is induced by the prior on the model’s weights, which in turn is affected by all the hyper-parameters. Hyper-parameters that do not affect feature sparsity, e.g. λ(il), can be used to encode domain knowledge about the PVE.

Proportion of Variance Explained
Bayesian neural networks
Stein Gradient Estimator
Prior knowledge about sparsity
Prior on the number of relevant features
Feature allocation
Prior knowledge on the PVE
PVE for Bayesian neural networks
Optimizing hyper-parameters according to prior PVE
Learning BNNs with variational inference
Related literature
Experiments
Synthetic data
Results
Public real-world UCI datasets
Web traffic time series prediction
Metabolite prediction using genetic data
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call