The calibrated classifier

Informed decision-making based on classifiers requires that the confidence in their predictions reflect the actual error rates. When this happens, one speaks of a calibrated model. Recent work showed that expressive neural networks are able to overfit the cross-entropy loss without losing accuracy, thus producing overconfident (i.e. miscalibrated) models. We analyse several definitions of calibration and the relationships between them, look into related empirical measures and their usefulness, and explore several algorithms to improve calibration.

References

[Ben19C]

Calibration for Anomaly Detection, Adrian Benton.

Aug 2019

Recent work on model calibration found that a simple variant of Platt scaling, temperature scaling, is effective at calibrating modern neural networks across an array of classification tasks. However, when negative examples overwhelm the dataset, classifiers will often be biased to producing well-calibrated predictions for negative examples, but have trouble producing well-calibrated predictions …

[Daw82W]

The Well-Calibrated Bayesian, A. P. Dawid.

Sep 1982

Suppose that a forecaster sequentially assigns probabilities to events. He is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent. We prove a theorem to the effect that a coherent Bayesian expects to be well calibrated, and consider its destructive implications for the theory of …

[Fer19S]

Setting decision thresholds when operating conditions are uncertain, Cèsar Ferri, José Hernández-Orallo, Peter Flach.

Jul 2019

The quality of the decisions made by a machine learning model depends on the data and the operating conditions during deployment. Often, operating conditions such as class distribution and misclassiﬁcation costs have changed during the time since the model was trained and evaluated. When deploying a binary classiﬁer that outputs scores, once we know the new class distribution and the new cost …

[Guo17C]

On Calibration of Modern Neural Networks, Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger.

Jul 2017

Conﬁdence calibration – the problem of predicting probability estimates representative of the true correctness likelihood – is important for classiﬁcation models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors …

[Kul19T]

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration, Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, Peter Flach.

2019

Class probabilities predicted by most multiclass classifiers are uncalibrated, often tending towards over-confidence. With neural networks, calibration can be im proved by temperature scaling, a method to learn a single corrective multiplicative factor for inputs to the last softmax layer. On non-neural models the existing methods apply binary calibration in a pairwise or one-vs-rest fashion. We …

[Kum18T]

Trainable Calibration Measures for Neural Networks from Kernel Mean Embeddings, Aviral Kumar, Sunita Sarawagi, Ujjwal Jain.

Jul 2018

Modern neural networks have recently been found to be poorly calibrated, primarily in the direction of over-confidence. Methods like entropy penalty and temperature smoothing improve calibration by...

[Lak17S]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundell.

2017

Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem. Bayesian NNs, which learn a distribution over weights, are currently the state-of-the-art for estimating predictive uncertainty; however these require significant …

[Lin17F]

Focal Loss for Dense Object Detection, Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár.

Aug 2017

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In …

[Muk20C]

Calibrating deep neural networks using focal loss, Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip Torr, Puneet Dokania.

2020

Miscalibration -- a mismatch between a model's confidence and its correctness -- of Deep Neural Networks (DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated and confident. We show that, as opposed to the standard cross-entropy loss, focal loss (Lin et al., 2017) allows us to learn models that are already very well calibrated. When combined with …

[Nic05P]

Predicting good probabilities with supervised learning, Alexandru Niculescu-Mizil, Rich Caruana.

2005

We examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a characteristic sigmoid shaped distortion in the predicted probabilities. Models such as Naive Bayes, which make unrealistic independence …

[Per17R]

Regularizing Neural Networks by Penalizing Confident Output Distributions, Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton.

Jan 2017

We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the …

[Pla99P]

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, John C. Platt.

1999

The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we …

[Rag18B]

Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters, Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale Doshi-Velez, Emma Brunskill.

Jul 2018

In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly dependent on the calibration of estimated behaviour policy models: how precisely the behaviour policy is estimated from data. We show how powerful parametric …

[Vai19E]

Evaluating model calibration in classification, Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, Thomas Schön.

Apr 2019

Probabilistic classifiers output a probability distribution on target classes rather than just a class prediction. Besides providing a clear separation of prediction and decision making, the main advantage of probabilistic models is their ability to represent uncertainty about predictions. In safety-critical applications, it is pivotal for a model to possess an adequate sense of uncertainty, which …

[Wid19C]

Calibration tests in multi-class classification: A unifying framework, David Widmann, Fredrik Lindsten, Dave Zachariah.

2019

In safety-critical applications a probabilistic model is usually required to be calibrated, i.e., to capture the uncertainty of its predictions accurately. In multi-class classification, calibration of the most confident predictions only is often not suffi cient. We propose and study calibration measures for multi-class classification that generalize existing measures such as the expected …

References

In this series →