A Trust Crisis In Simulation-Based Inference? Your Posterior Approximations Can Be Unfaithful

[Her23C] provides empirical evidence that several popular algorithms for Simulation-based Inference provide overconfident approximations by excluding unlikely but plausible parameters. Therefore, one should estimate reliability alongside performance, but the authors argue that popular metrics for the quality of posterior approximations are insufficient to detect overconfidence.

As an example, a classifier is trained to solve a two-sample test [Lop17R] to distinguish between samples from the true posterior and samples from the approximated posterior. The discriminative power of the classifier is measured by the Area Under the Receiver Operating Characteristic Curve (AUROC). In Figure 1, the same power of $AUROC=0.7$ was obtained for two approximations. While both are equally good, the overconfident one is more likely to produce unfaithful results by wrongly excluding plausible parameters.

[Her23C] showcases the failure of detecting overconfidence with popular metrics by fitting a classifier for the Classifier Two-Sample Test. The method does not allow to distinguish a conservative approximation from an overconfident one.

The authors ran extensive experiments on seven datasets to investigate the posterior estimation for neural density estimators and ABC algorithms. Instead of estimating quality of approximation to a target distribution with a hypothesis test, their degree of conservativeness was assessed using the expected coverage of credible regions, defined as

$$ \mathbb{E}_{p(\mathbf{x})}\mathbb{E}_{p(\theta \mid \mathbf{x})}\left[\mathbb{1}(\theta \in \Theta_{\hat{p}(\theta \mid \mathbf{x})} (1-\alpha)) \right] \approx \frac{1}{n}\sum^n_{i=1}\mathbb{1}(\theta_i^{\ast} \in \Theta_{\hat{p}(\theta_i \mid \mathbf{x}_i)} (1-\alpha)) $$

where $(\theta^{\ast}_i, \mathbf{x}_i) \sim p(\theta, \mathbf{x})$ are i.i.d. samples from the true posterior and $\Theta_{\hat{p}(\theta \mid \mathbf{x})}(1-\alpha)$ denotes the $1-\alpha$ highest density credible interval of the estimated posterior.

The authors define a conservative posterior estimator to cover at least the credibility level. Thus, the goal is to obtain an expected coverage probability equal or larger than the credibility level. Further, the expected coverage alone does not tell much about the information gain of a posterior over its prior. Therefore, the authors propose to use the expected information gain as a second metric.

$$ \mathbb{E}_{p(\theta, \mathbf{x})}\left[ \log p(\theta \mid \mathbf{x}) - \log p(\theta) \right] $$

To investigate conservativeness of algorithms, the expected coverage probabilities are computed on a set of unseen samples from $p(\theta \mid \mathbf{x})$, for all confidence levels under consideration. Well calibrated estimators have an expected coverage probability of $1-\alpha$ and produce a diagonal line when the coverage probability is plotted. Conservative estimators produce a curve that is above the diagonal line, overconfident estimators a curve below the diagonal line.

The AUCROC for the different algorithms and datasets shows that no method produces conservative results on all tasks. The sequential versions tend to be overconfident while requiring less simulations, i.e. a smaller simulation budget, as seen in Figure 2. The authors also investigate ensembles, which turned out to be more conservative.

Plotting expected coverage vs. credible level shows whether an algorithm produces conservative or confident approximations. In all cases, small simulation budgets lead to overconfident estimators, as the brighter lines are below the diagonal. Sequential models tend to have curves below the diagonal for several budgets.

The authors obtained the following key findings:

All considered algorithms produced non-conservative posterior approximations for at least one task. Small simulation budgets amplify this behavior. However, a large budget is no guarantee for a conservative posterior approximation.
Amortized approaches are more conservative than non-amortized ones. This might be due to the latter strengthening their overconfidence each round.
The expected coverage probability of an ensemble is larger than that of the average individual model. The ensemble size is beneficial for the expected coverage probability as well.

We present extensive empirical evidence showing that current Bayesian simulation-based inference algorithms can produce computationally unfaithful posterior approximations. Our results show that all benchmarked algorithms -- (S)NPE, (S)NRE, SNL and variants of ABC -- can yield overconfident posterior approximations, which makes them unreliable for scientific use cases and falsificationist inquiry. …

The goal of two-sample tests is to assess whether two samples, $S_P \sim P^n$ and $S_Q \sim Q^m$, are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build two-sample tests is the use of binary classifiers. In particular, construct a dataset by pairing the $n$ examples in $S_P$ with a positive label, and by pairing the $m$ examples in $S_Q$ with a …

References

In this series →