$$ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} $$

About the workshop

There are two major philosophic interpretations of probability: frequentist and Bayesian. The frequentist interpretation is based on the idea that probability represents the long-term frequency of events. The Bayesian interpretation is based on the idea that probability represents the degree of belief in an event. The Bayesian interpretation can sometimes be more intuitive, and more naturally used to tackle certain problems.

The Bayesian approach is particularly well-suited for modeling uncertainty about a model’s parameters. This is because it allows for the use of prior knowledge to update their posterior probability distribution, which is not directly possible using the frequentist approach.

In this course we introduce the participants to Bayesian methods in machine learning. We start with a brief introduction to Bayesian probability and inference, then review the major approaches for the latter discussing their advantages and disadvantages. We introduce the model-based machine learning approach and discuss how to build probabilistic models from domain knowledge. We do this with the probabilistic programming framework pyro, working through a number of examples from scratch. Finally, we discuss how to criticize and iteratively improve a model.

Learning outcomes

Parameter learning in the Bayesian setup

Get to know the Bayesian methodology and how Bayesian inference can be used as a general purpose tool in machine learning.
Understand computational challenges in Bayesian inference and how to overcome them.
Get acquainted to the foundations of approximate bayesian inference.
Take first steps in probabilistic programming with pyro.
Get to know the model-based machine learning approach.
Learn how to build probabilistic models from domain knowledge.
Learn how to explore and iteratively improve your model.
See many prototypical probabilistic model types, e.g. Gaussian mixture, Gaussian processes, latent Dirichet allocation, variational auto-encoder, or Bayesian neural networks.
Learn how to combine probabilistic modeling with deep learning.

Structure of the workshop

Part 1: Introduction to Bayesian probability and inference

We start with a brief introduction to Bayesian probability and inference, then review the basic concepts of probability theory under the Bayesian interpretation, highlighting the differences to the frequentist approach.

Recap on Bayesian probability.
Example: determine the fairness of a coin.
Introduction to pyro. Source: XKCD 1132

Part 2: Exact inference with Gaussian processes

Gaussian processes are a powerful tool for modeling uncertainty in a wide range of problems. We give a brief introduction to inference with Gaussian distributions, followed by the Gaussian process model. We use the latter to model the CO₂ data from the Mauna Loa observatory. Finally, we explain how sparse Gaussian processes allow working with larger data sets.

Sparse Gaussian process posterior with $4$ pseudo-observations

Inference with Gaussian variables.
The Gaussian process model.
Example: Manua Loa CO2 data.
Sparse Gaussian processes.

Part 3: Approximate inference with Markov chain Monte Carlo (MCMC)

Because computing the posterior distribution is often intractable, approximate techniques are required. One of the oldest and most popular approaches is Markov chain Monte Carlo (MCMC): Through a clever procedure it is possible to sample from the posterior rather than computing its analytic form. We start with a brief introduction the general class of methods, followed by the Metropolis-Hastings algorithm. We conclude with Hamiltonian MCMC and apply these algorithms to the problem of causal inference.

Causal models for age (A), marriages (M), divorces (D)

Introduction to MCMC.
Example: King Markov and his island kingdom.
Metropolis-Hastings, Hamiltonian MCMC, and NUTS.
Example: Marriages, divorces and causality.

Part 4: Approximate inference with stochastic variational inference (SVI)

MCMC methods suffer from the curse of dimensionality. In order to overcome this problem stochastic variational inference (SVI) can be used. SVI is a powerful tool for approximating the posterior distribution using a parametric family. After explaining the basics of SVI, we discuss common approaches to design parametric families, such as the mean field assumption. We demonstrate the power of SVI using variational auto-encoders to model semantic properties of yearbook images taken within the span of a full century.

Stochastic variational inference schema

Introduction to SVI.
Posterior approximations: the mean field assumption.
Amortized SVI and variational auto-encoders.
Example: Modeling yearbook faces through the ages with variational auto-encoders.

Part 5: Laplace approximation and Bayesian neural networks

In this part we introduce the Laplace approximation and discuss how it can be used to approximate the posterior of the parameters of a neural network given the training data. The method combines a Gaussian distribution as the posterior approximation with a particular estimation technique. We then comment on how Bayesian neural networks can be used to obtain well-calibrated classifiers. Laplace Approximation

Introduction to Laplace approximation.
Bayesian neural networks and the calibration of classifiers.
Example: Bayesian neural networks for wine connoisseurs.

Additional material

The course contains a number of additional topics not included in the schedule. These are made available to the participants after the workshop, but, time allowing, we cover one or two of these topics chosen by the audience.

Example: Understanding influence factors for the development of asthma through Markov chains.
Graphical model for the development of allergies
Example: Satellite image clustering with mixture models and latent Dirichlet allocation like models.

Local marginal distributions of the color channels in a satellite image

Example: Assessing skills by questionnaire’s data with latent skill models.

Prerequisites

Probabilistic modeling demands a good understanding of probability theory and mathematical modeling. The participants should be familiar with the basic concepts.
We assume prior exposure to machine learning and deep learning. We will not cover these in the course.
Knowledge in python is required in order to complete the exercises. Experience with pytorch is not required but can be helpful.

Companion seminar

Accompanying the course, we have offered the seminar Uncertainty quantification for neural networks to cover neighboring topics that could not make it into the course due to time constraints. It was held online over the course of several weeks and consisted of talks reviewing papers in the field.

Our seminar is informal and open to everyone: we welcome participation, both in the discussions or presenting papers.

References

[Bin19P]

Pyro: Deep Universal Probabilistic Programming, Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, Noah D. Goodman.

2019

Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research. To scale to large data sets and high-dimensional models, Pyro uses stochastic variational inference algorithms and probability distributions built on top of PyTorch, a modern GPU-accelerated deep learning framework. To accommodate complex or model-specific …

[Ble03L]

Latent Dirichlet Allocation, David M. Blei, Andrew Y. Ng, Michael I. Jordan.

2003

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the …

[Ble17V]

Variational Inference: A Review for Statisticians, David M. Blei, Alp Kucukelbir, Jon D. McAuliffe.

Apr 2017

One of the core problems of modern statistics is to approximate difficult-to-compute probability densities. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation involving the posterior density. In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities …

[Dax21B]

Bayesian Deep Learning via Subnetwork Inference, Erik Daxberger, Eric Nalisnick, James U. Allingham, Javier Antoran, Jose Miguel Hernandez-Lobato.

Jul 2021

The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights …

[Dax22L]

Laplace Redux - Effortless Bayesian Deep Learning, Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, Philipp Hennig.

Jan 2022

We argue that the Laplace approximation is a simple yet competitive and versatile method for Bayesian deep learning that deserves wider adoption. We provide a plug'n'play library for PyTorch to...

[Esc21M]

Mixtures of Laplace Approximations for Improved Post-Hoc Uncertainty in Deep Learning, Runa Eschenhagen, Erik Daxberger, Philipp Hennig, Agustinus Kristiadi.

Nov 2021

Deep neural networks are prone to overconfident predictions on outliers. Bayesian neural networks and deep ensembles have both been shown to mitigate this problem to some extent. In this work, we aim to combine the benefits of the two approaches by proposing to predict with a Gaussian mixture model posterior that consists of a weighted sum of Laplace approximations of independently trained deep …

[Gal16D]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Yarin Gal, Zoubin Ghahramani.

Oct 2016

Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting …

[Gar22B]

Bayesian Optimization, Roman Garnett.

2022

This is a monograph on Bayesian optimization. The book aims to provide a self-contained and comprehensive introduction to Bayesian optimization, starting “from scratch” and carefully developing all the key ideas along the way. The intended audience is graduate students and researchers in machine learning, statistics, and related fields. However, I also hope that practitioners and researchers from …

[Gha15P]

Probabilistic machine learning and artificial intelligence, Z Ghahramani.

May 2015

How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, …

[Hen20P]

Probabilistic Machine Learning, Philipp Hennig.

2020

In the "Corona Summer" of 2020, Prof. Dr. Philipp Hennig remotely taught the course on Probabilistic Machine Learning within the Tübingen International Master Programme on Machine Learning. The course consists of two ~90min lectures per week (26 lectures in total) plus a weekly practical / tutorial. Videos of all lectures are available on the youtube channel of the Tübingen Machine Learning …

[Imm21I]

Improving predictions of Bayesian neural nets via local linearization, Alexander Immer, Maciej Korzepa, Matthias Bauer.

Mar 2021

The generalized Gauss-Newton (GGN) approximation is often used to make practical Bayesian deep learning approaches scalable by replacing a second order derivative with a product of first order derivatives. In this paper we argue that the GGN approximation should be understood as a local linearization of the underlying Bayesian neural network (BNN), which turns the BNN into a generalized linear …

[Izm21W]

What Are Bayesian Neural Network Posteriors Really Like?, Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson.

2021

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as mean-field variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch …

[Kri20B]

Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks, A. Kristiadi, M. Hein, P. Hennig.

Jul 2020

The point estimates of ReLU classification networks---arguably the most widely used neural network architecture---have been shown to yield arbitrarily high confidence far away from the training data. This architecture, in conjunction with a maximum a posteriori estimation scheme, is thus not calibrated nor robust. Approximate Bayesian inference has been empirically demonstrated to improve …

[Kun19L]

Limitations of the empirical Fisher approximation for natural gradient descent, Frederik Kunstner, Philipp Hennig, Lukas Balles.

2019

Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this …

[Mar15O]

Optimizing Neural Networks with Kronecker-factored Approximate Curvature, James Martens, Roger Grosse.

Jun 2015

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network’s Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks …

[Osa19P]

Practical Deep Learning with Bayesian Principles, Kazuki Osawa, Siddharth Swaroop, Mohammad Emtiyaz E Khan, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio Yokota.

2019

Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar …

[Pha22C]

Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro, Du Phan, Neeraj Pradhan, Martin Jankowiak.

Jul 2022

NumPyro is a lightweight library that provides an alternate NumPy backend to the Pyro probabilistic programming language with the same modeling interface, language primitives and effect handling abstractions. Effect handlers allow Pyro's modeling API to be extended to NumPyro despite its being built atop a fundamentally different JAX-based functional backend. In this work, we demonstrate the power …

[Qiu20Q]

Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel, Xin Qiu, Elliot Meyerson, Risto Miikkulainen.

Apr 2020

Neural Networks (NNs) have been extensively used for a wide spectrum of real-world regression tasks, where the goal is to predict a numerical outcome such as revenue, effectiveness, or a quantitative result. In many such tasks, the point prediction is not enough: the uncertainty (i.e. risk or confidence) of that prediction must also be estimated. Standard NNs, which are most often used in such …

[Ras06G]

Gaussian processes for machine learning, Carl Edward Rasmussen, Christopher K. I. Williams.

2006

Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in the machine-learning community over the past decade, and this book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning. The treatment is comprehensive and self-contained, …

[Rit18S]

A Scalable Laplace Approximation for Neural Networks, Hippolyt Ritter, Aleksandar Botev, David Barber.

2018

We leverage recent insights from second-order optimisation for neural networks to construct a Kronecker factored Laplace approximation to the posterior over the weights of a trained network. Our approximation requires no modification of the training procedure, enabling practitioners to estimate the uncertainty of their models currently used in production without having to retrain them. We …

[Wil20B]

Bayesian deep learning and a probabilistic perspective of generalization, Andrew Gordon Wilson, Pavel Izmailov.

2020

The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism …

[Win20M]

Model based machine learning, John Winn, Christopher M. Bishop, Thomas Diethe, John Guiver, Yordan Zaykov.

2020

Today, machine learning is being applied to a growing variety of problems in a bewildering variety of domains. When doing machine learning, a fundamental challenge is connecting the abstract mathematics of a particular machine learning technique to a concrete, real-world problem. This book tackles this challenge through model-based machine learning. Model-based machine learning is an approach …

[You22B]

Bayesian Modeling and Uncertainty Quantification for Learning to Optimize: What, Why, and How, Yuning You, Yue Cao, Tianlong Chen, Zhangyang Wang, Yang Shen.

2022

Optimizing an objective function with uncertainty awareness is well-known to improve the accuracy and confidence of optimization solutions. Meanwhile, another relevant but very different question...

About the workshop

Learning outcomes

Structure of the workshop

Part 1: Introduction to Bayesian probability and inference

Part 2: Exact inference with Gaussian processes

Part 3: Approximate inference with Markov chain Monte Carlo (MCMC)

Part 4: Approximate inference with stochastic variational inference (SVI)

Part 5: Laplace approximation and Bayesian neural networks

Additional material

Prerequisites

Companion seminar

References

In this series →