Benchmarking Simulation-Based Inference

A much-needed benchmark for methods of simulation-based inference.

The field of SBI has seen several new algorithms appear recently, in particular some based on neural density estimators. However, a public, standard benchmark with different kinds of tasks has been missing. Lueckmann et al. [Lue21B], introduce a benchmarking framework which allows the comparison over tasks with different dimensions, characteristics of the posterior and a number of relevant variables. Also, the benchmark can be extended with new algorithms, data sets and metrics via pull requests on GitHub.

As a baseline, the authors provide a total of eight algorithms for classic Monte Carlo Approximate Bayesian Computation (ABC), Posterior, Likelihood and Likelihood-ratio estimation, and their sequential counterparts. The algorithms are classified according to [Cra20F] and depicted in Figure 1. The selection of algorithms was intentionally kept small to focus on implementation details and hyperparameter tuning.

[Lue21B] focus on the four types of algorithms and their sequential counterparts, as classified by [Cra20F]. The authors compare neural with ABC-based, sequential with non-sequential, and amortized with non-amortized algorithms.

The authors use ten public datasets which, in contrast to most real problems, allow sampling from the true posterior. This is motivated by the shortcomings of single-sample metrics, as an algorithm obtaining a good MAP point estimate could pass the posterior-predictive check even if the rest of the estimated posterior is a poor fit. This observation led to include further metrics such as the Maximum Mean Discrepancy metric or Classifier 2-Sample Tests. Each dataset targets specific traits of an algorithm, e.g. the behaviour with increasing dimensions or spurious variables, or how it deals with multimodal distributions.

The authors obtained the following key results:

  1. Algorithms using Neural Networks for density estimation work better than ABC-based methods.
  2. The choice of comparison metric affects ranking.
  3. Sequential methods are more sample efficient.
  4. Some tasks show substantial room for improvement.
  5. There is no algorithm to rule them all, i.e. “no free lunch”

Performance of different algorithms on the two-moons dataset for different simulation budgets and different metrics. Across all metrics the neural-based estimators perform better than ABC. Sequential estimators tend to perform at least as well as non-sequential ones, but the ranking changes with the metric. The fourth key result, however, is better visible on different plots in the paper [Lue21B].

As the authors note, this benchmarking framework will help to compare methods and to identify problems and strengths. Its value will further increase as researchers and practitioners add more algorithms, tasks and metrics.

References

In this series