A supervised machine learning application predicts values $Y \in \mathcal{Y}$ from values $X \in \mathcal{X}$, by fitting a function $\hat{f}$ in some model class $\mathcal{F}$, using data $D := \lbrace (X_{i}, Y_{i}) \rbrace_{i = 1}^n.$ A learning algorithm $\mathcal{A}$ maps $D \mapsto \hat{f} = \hat{f}_{D}.$ Now, given $\hat{f}_{D}$ and a loss function $l : \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R}^+$, one wishes to estimate its expected prediction error

\begin{equation} \operatorname{Err} (\hat{f}) := \mathbb{E} [l (\hat{f}_{D} (X), Y)], \label{eq-err}\tag{1} \end{equation}

but here one must be careful with the meaning of the expectation. The function $\hat{f}_{D}$ itself is random, being the image by $\mathcal{A}$ of a random quantity, $\hat{f}_{D} =\mathcal{A} (D)$, so the above spelled out is:

\[ \operatorname{Err} (\hat{f}) =\mathbb{E}_{D \sim \mathbb{P}_{X, Y}^n} [\mathbb{E}_{X, Y} [l (\hat{f}_{D} (X), Y) |D]] . \]

To be more precise, $\operatorname{Err}(\hat{f})$ is a function of the tuple $(\mathcal{A}, l, n, \mathbb{P}_{X Y}).$ It is the expected error on future test points, when training on any data set of size $n$ drawn i.i.d. from the same distribution.1 1 The number of samples matters: estimators using subsets of the data, e.g. $K$-fold cross-validation, will be biased for the errors of algorithms trained on data sets of full size $n.$ This quantity is of interest e.g. when designing learning algorithms, but it is not what a practitioner usually wants. Instead, for applications one is interested in the error that some specific $\hat{f}_{D}$, trained on a fixed dataset $D = \lbrace (x_{i}, y_{i}) \rbrace_{i = 1}^n$, will incurr when deployed on unseen data, or what is usually called generalization error or prediction error:

\begin{equation} \operatorname{Err}_{X Y} (\hat{f}) := \mathbb{E}_{X, Y} [l (\hat{f} (X), Y) |D] \label{eq-err-xy}\tag{2} \end{equation}

(here and in the sequel we will ommit the subindex in $\hat{f}_{D}$, as well as the dependency on $\mathcal{A}, n, l$ and the distribution of $X, Y$ for brevity). Note that we assume $\hat{f}_{D}$ to be the outcome of a model fitting procedure, and not of model selection. More on this below.

Because the true distribution for $X, Y$ is unknown, estimates for these quantities are required. Most methods can be roughly classified as resampling based or analytic, see Figure 1. Here we focus on CV, but the results of Section 2 extend to most of them (for linear models).

Figure 1. Common approaches to the estimation of different prediction errors.

1 A note on model selection

It is important to note that we are always assuming that no model selection procedure is performed using the training data, i.e. no feature selection, hyperparameter tuning nor selection of an imputation procedure. Otherwise, there would be correlations between the selected model, which is a random variable itself, and any statistics used in inference later, e.g. confidence intervals, hypothesis tests or error estimates.2 2 Consider a regression problem in which the data is used to do feature selection. Intuitively, in a significance test for the chosen variables, these are necessarily going to be significant for the data that was used to select them in the first place. Additionally, it is well known that the CV estimate of the error of the model chosen by a selection procedure is strongly biased downwards, because the statistics of the extreme value $\min \lbrace \widehat{\operatorname{Err}}_{1}, \ldots, \widehat{\operatorname{Err}}_{k} \rbrace$ are different from those of the $\widehat{\operatorname{Err}}_{j}.$

The simplest approach for simultaneous model selection and error estimation is a simple train - test split of the data. But this comes at cost of both bias and variance because of the reduced data on which the models are chosen and the winner finally evaluated. If computational resources allow for it, it is common to use a nested CV procedure, which has great computational cost, and for which there are no known theoretical guarantees. There is some work attempting to debias the minimum in [Tib09B], and more recently [Gua18T], as well as a growing body of research on valid post-selection inference for specific algorithms which enable inference at lower computational cost, and with guarantees.3 3 Post-selection inference refers to a set of techniques enabling statistical inference (e.g. computing confidence intervals) after doing model selection using the data. More on these topics will follow on our website.

We focus then on CV for error estimation, but what error are we talking about?

2 What cross validation estimates

$K$-fold CV first splits $D$ into equally sized subsets $I_{j}$, $j \in \lbrace 1, \ldots, K \rbrace.$ We assume that $K$ divides $n$ for simplicity, and write $i \in I_{j}$ for $(x_{i}, y_{i}) \in I_{j}.$ Then it trains a function $\hat{f}^{(- j)}$ on all but $I_{j}$, and evaluates it on $I_{j}$ with:

\[ \operatorname{CV}_{j} (\hat{f}) := \frac{K}{n} \sum_{i \in I_{j}} l (\hat{f}^{(- j)} (x_{i}), y_{i}) . \]

The CV error is:

\begin{equation} \widehat{\operatorname{Err}}^{\operatorname{cv}} (\hat{f}) := \frac{1}{K} \sum_{j = 1}^K \operatorname{CV}_{j} (\hat{f}) . \label{eq-err-cv}\tag{3} \end{equation}

A subtle question is what exactly does $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ estimate, and the widely accepted answer for over two decades is that it is $\operatorname{Err}$ and not $\operatorname{Err}_{X Y}.$ Intuitively, $\operatorname{CV}_{j}$, the inner sum in (3), is an estimate for $\operatorname{Err}_{X Y}$ for fixed $I_{j}$, and the outer sum estimates $\operatorname{Err}$, with the $I_{j}$ being different samples from the data set distribution.4 4 The estimate for $\operatorname{Err}$ is for data sets of size $n - n / K$, and typically slightly biased upwards for data sets of size $n.$ Despite this being known, detailed analyses of the mismatch between $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ and $\operatorname{Err}_{X Y}$ have been lacking, and one of the results of this paper is a rigorous treatment of this question in the linear case. Specifically, they show that CV for a certain type of problem has the property that $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is independent of $\operatorname{Err}_{X Y}$ given $\boldsymbol{X}$ [Bat21C] (Theorem 1). In other words:

Theorem 1: For certain model classes,5 5 Linear Gaussian with constant noise variance, fitted with squared loss. given two datasets $D = \lbrace (X_{i}, Y_{i}) \rbrace_{i = 1}^n, D’ = \lbrace (X_{i}, Y_{i}’) \rbrace_{i = 1}^n$ sampled from the same distribution and with the same feature matrix $\boldsymbol{X}$, and $\operatorname{Err}_{X Y}$ and $\operatorname{Err}_{X Y’}$ being the true errors and $\widehat{\operatorname{Err}}^{\operatorname{cv}}_{X Y}$ and $\widehat{\operatorname{Err}}_{X Y’}^{\operatorname{cv}}$ their CV estimates, one has:

\[ (\operatorname{Err}_{X Y}, \widehat{\operatorname{Err}}^{\operatorname{cv}}_{X Y}) \overset{d}{=} (\operatorname{Err}_{X Y}, \widehat{\operatorname{Err}}^{\operatorname{cv}}_{X Y’}) . \]

«This means that for the purpose of estimating $\operatorname{Err}_{X Y}$, we have no reason to prefer using the cross-validation estimate with $(\boldsymbol{X}, \boldsymbol{Y})$ to using the cross-validation estimate with a different data set $(\boldsymbol{X}, \boldsymbol{Y}’)$, even though we wish to estimate the error of the model fit on $(\boldsymbol{X}, \boldsymbol{Y})$»

The consequence derived from this is that $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is a better estimate of an intermediate quantity $\operatorname{Err}_{X}$ than of $\operatorname{Err}_{X Y}.$

Corollary 2: Under the conditions of Theorem 1,

\[ \mathbb{E} [(\widehat{\operatorname{Err}}^{\operatorname{cv}} -\operatorname{Err}_{X Y})^2] \geqslant \mathbb{E} [(\widehat{\operatorname{Err}}^{\operatorname{cv}} -\operatorname{Err}_{X})^2], \]

where $\operatorname{Err}_{X} := \mathbb{E}_{Y} [\operatorname{Err}_{X Y} |X].$

Figure 2. [Bat21C] (Figure 3). Left: mean squared error of the CV point estimate of prediction error relative to three different estimands: $\operatorname{Err}$, $\operatorname{Err}_{X}$, and $\operatorname{Err}_{\operatorname{XY}}$ . Center: coverage of $\operatorname{Err}$, $\operatorname{Err}_{X}$, and $\operatorname{Err}_{\operatorname{XY}}$ by the naive cross-validation intervals in a homoskedastic Gaussian linear model. The nominal miscoverage rate is 10%. Each pair of points connected by a line represents 2000 replicates with the same feature matrix $\boldsymbol{X}.$ Right: 2000 replicates with the same feature matrix and the line of best fit (blue).

Finally, two additional corollaries yield the conclusions of the whole section, namely:6 6 These results hold only for a linear problem with no regularization. The authors expect CV to be closer to $\operatorname{Err}_{X Y}$ when regularization is applied.

$\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is uncorrelated with $\operatorname{Err}_{X Y}$, in a certain asymptotic sense.
CV has larger error for estimating $\operatorname{Err}_{X Y}$ than for estimating $\operatorname{Err}$ or $\operatorname{Err}_{X}.$

Other estimators. The theory developed in Sections 2 and 3 of the paper applies not only to CV but to other methods as well: data splitting with refitting (train on one half of the data, evaluate on the other, then refit on the whole data set), Mallow’s $C_{p}$ and bootstrap.

3 Computing standard errors

Besides a point estimate for the error one always wants an indication of the trust that can be placed in it, i.e. a confidence interval. In order to do so, an estimate of the standard error is required. For applications one wants a confidence interval for $\operatorname{Err}_{X Y}$ and for algorithm design or benchmarking, one for $\operatorname{Err}.$ Note that even if $\operatorname{Err}=\mathbb{E} [\operatorname{Err}_{X Y}]$ and $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is (almost) unbiased for $\operatorname{Err}$, whether it estimates one or the other matters for what it is that its sampling variance informs us about.

Let $e_{i} := l (f^{(- I_{j (i)})} (X_{i}), Y_{i})$ be the error for sample $X_{i}, Y_{i}$ when trained on the $K - 1$ folds not containing it. Then

\begin{equation} \operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{cv}}) =\operatorname{Var} \left( \frac{1}{n} \sum_{i = 1}^n e_{i} \right) \approx \frac{1}{n} e_{1}, \label{eq-variance-err-cv}\tag{4} \end{equation}

where equality would hold only if the e$_{i}$ where i.i.d. So the sample estimate for the standard error of the point estimate $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is:

\begin{equation} \widehat{\operatorname{se}}^{\operatorname{cv}} = \frac{1}{\sqrt{n}} \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (e_{i} - \widehat{\operatorname{Err}}^{\operatorname{cv}})^2} . \label{eq-naive-se}\tag{5} \end{equation}

With this, the standard confidence interval would be

\begin{equation} (\widehat{\operatorname{Err}}^{\operatorname{cv}} - z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{cv}}, \widehat{\operatorname{Err}}^{\operatorname{cv}} + z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{cv}}) \label{eq-naive-ci}\tag{6} \end{equation}

where $z_{1 - \alpha / 2}$ is the $1 - \alpha / 2$ quantile of the standard normal distribution, e.g. 1.96 for a 95% CI.7 7 Because in (4) we assumed that the $e_{i}$ were (roughly) independent, $\widehat{\operatorname{Err}}^{\operatorname{CV}}$ is asymptotically normal by the CLT and this confidence interval is the natural one.

However, since every point in $D$ is used both for training and testing, the independence assumption in (4) does not hold and the closer $K$ is to its maximum $n$, the higher the correlations will be, hence the total true variance. Any confidence interval built using (4) and (5) will be too small and have poor coverage.

Even if one does not do CV, but is in the optimal situation of having $K$ independent training sets and $L$ independent test sets, because for each of the training sets many errors will be computed, these errors will not be conditionally independent, given one training set. This will make the naive estimate (5) too optimistic. In the extreme case of leave-one-out CV, every sample is used $n$ times for training and testing, leading to even higher sampling variance.

This was shown in the fundamental paper [Ben04N], where the main result was a proof that any estimate of the variance of $\operatorname{CV}$ is biased and the nature of the bias is neither determined by the number of folds nor the sample size. The key is the structure of the covariance matrix of the errors $e_{i}$, see Figure 3. [Ben04N] show that it is only parametrised by three numbers,

$$ \operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{cv}}) = \frac{1}{n^2} \sum_{i, j = 1}^n \operatorname{Cov} (e_{i}, e_{j}) $$ \begin{equation} = \frac{1}{n} \sigma^2 + \left( \frac{1}{K} - \frac{1}{n} \right) \omega + \left( 1 - \frac{1}{K} \right) \gamma, \label{eq-var-err-cv}\tag{7} \end{equation}

Figure 3. [Bat21C] (Figure 7). «Covariance structure of the CV errors. Red entries correspond to the covariance between points in the same fold, and blue entries correspond to the covariance between points in different folds.» where $\sigma^2$ is the variance of $e_{i}$ when $D$ has size $n - n / K$, $\omega$ is the in-block covariance of errors due to a common training set, and $\gamma$ is the between-blocks covariance due to the dependence between training sets $I_{k}.$ Note how increasing $K$ deteriorates the third term, and increasing $n$ the second one. There are no constraints as to how $\omega$ and $\gamma$ behave, other than being at most $| \sigma^2 |$, and in practice both are seen to be positive. Therefore, typically

\[ \operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{cv}}) > \frac{1}{n} \operatorname{Var} (e_{1}), \]

and (5) will underestimate the standard error. Furthermore, $\gamma$ is seen to be of order $\sigma^2$, especially in the presence of outliers, irrespective of the number of folds [Ben04N] (Section 7), see Figure 4. So using (6) for decisions is fraught with dangers in practical cases.

In order to address the issue, one must then either modify CV with new sampling / splitting schemes, or add distributional assumptions. Recent work has proposed splitting the data in half, then doing CV in different variants, but this typically proves to be either too costly or too conservative in the estimates. Instead, [Bat21C] modify CV with a nested procedure. Figure 4. [Ben04N] Figure 5b. contributions of $\sigma^2$, $\omega$ and $\gamma$ to $\operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{CV}})$ vs number of folds $K$, in the presence of outliers.

The definition of $\widehat{\operatorname{se}}.$ In practice, instead of (5), another rougher approximation is typically done. Once the $K$ values $\operatorname{CV}_{j} (\hat{f})$ have been computed, one can argue that

\begin{eqnarray*} \operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{cv}}) & = & \operatorname{Var} \left( \frac{1}{K} \sum_{j = 1}^K \operatorname{CV}_{j} (\hat{f}) \right)\\\ & \approx & \frac{1}{K} \operatorname{Var} (\operatorname{CV}_{j} (\hat{f})), \label{eq-variance-err-cv-k}\tag{8} \end{eqnarray*}

where equality would hold if the $\operatorname{CV}_{j}$ where i.i.d. The sample estimate for the standard error is8 8 This equation is just scores.std()/sqrt(K) \[\widehat{\operatorname{se}}^{\operatorname{cv}} = \frac{1}{\sqrt{K}} \sqrt{\frac{1}{K - 1} \sum_{j = 1}^K (\operatorname{CV}_{j} - \widehat{\operatorname{Err}}^{\operatorname{cv}})^2},\]

and the standard confidence interval:

\[ (\widehat{\operatorname{Err}}^{\operatorname{cv}} - z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{cv}}, \widehat{\operatorname{Err}}^{\operatorname{cv}} + z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{cv}}). \]

Now, this is a different interval than (6) and given the relatively low value of $K$, the normality assumption is less justified than before. But, crucially it has the same problem of not taking dependencies into account and therefore having necessarily poor coverage.

4 Nested cross validation

A word on nomenclature. Nested CV is typically used in machine learning for simultaneous model selection and error estimation, as suggested in [Var06B]. This is not what is done here.As we explained above, the problem with the confidence interval (6) is that the true variance of $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is higher than the one in (5) because of correlations in the errors introduced by using samples both for training and testing. In the naive approximation, one assumes that $\omega = 0$ and $\gamma = 0$ in (7), but we have seen in Figure 4 that in particular the covariances $\gamma$ arising from dependencies across folds, can be of the same magnitude as the variance $\sigma^2.$

The goal of the authors is to estimate the Mean Squared Error of CV:

\[ \operatorname{MSE}_{K, n} := \mathbb{E} [(\widehat{\operatorname{Err}}^{\operatorname{cv}} -\operatorname{Err}_{X Y})^2], \]

which because of the usual bias - variance decomposition and the low bias of $\widehat{\operatorname{Err}}^{\operatorname{cv}}$ is a reasonable (and conservative) proxy for $\operatorname{Var} (\widehat{\operatorname{Err}}^{\operatorname{cv}}).$ Note that despite the results of Section 2, the authors pick MSE wrt. $\operatorname{Err}_{X Y}.$ This is both because of technical reasons, and the fact that $\operatorname{Err}_{X Y}$ is what typically matters for practical applications.

In order to introduce the algorithm we consider first a single split of the data into $D_{\operatorname{train}} = \lbrace (X_{i}, Y_{i}) \rbrace_{i \in I_{\operatorname{train}}}$ and $D_{\operatorname{out}} = \lbrace (X_{i}, Y_{i}) \rbrace_{i \in I_{\operatorname{out}}}.$ For $i \in I_{\operatorname{train}}$ we write $(\tilde{X}_{i}, \tilde{Y}_{i}) = (X_{i}, Y_{i})$ and $\hat{f}$ is obtained after training on $D_{\operatorname{train}}.$ Exactly as in (2) we define, considering $D_{\operatorname{train}}$ as a r.v.

\[ \operatorname{Err}_{\tilde{X} \tilde{Y}} (\hat{f}) := \mathbb{E}_{X, Y} [l (\hat{f} (X), Y) |D_{\operatorname{train}}], \]

and the key is then the following decomposition, for any estimate $\widehat{\operatorname{Err}}_{\tilde{X} \tilde{Y}}$ of this quantity, which is obtained only using $D_{\operatorname{train}}$:

Lemma 10: ([Bat21C] (Holdout MSE))

$$ \mathbb{E}_{D_{\operatorname{train}}} [(\widehat{\operatorname{Err}}_{\tilde{X} \tilde{Y}} -\operatorname{Err}_{\tilde{X} \tilde{Y}})^2] $$ $$ = \underbrace{\mathbb{E}_{D} [(\widehat{\operatorname{Err}}_{\tilde{X} \tilde{Y}} - \overline{e}_{\operatorname{out}})^2]}_{(a)} \enspace - \enspace \underbrace{\mathbb{E}_{D} [(\overline{e}_{\operatorname{out}} -\operatorname{Err}_{\tilde{X} \tilde{Y}})^2]}_{(b)}.$$

It is possible to estimate $(a)$ and $(b)$ without bias, hence the left hand side as well.

Algorithm [Bat21C] (Nested Cross Validation):

Split $D$ into $K$ folds with $K - 1$ building $D_{\operatorname{train}}$ and the remaining one being $D_{\operatorname{out}}$, with indices $I_{\operatorname{out}}.$
For each split $j$:
1. Compute $\varepsilon _{j} := \overline{e}_{\operatorname{in}} = \widehat{\operatorname{Err}}_{\tilde{X} \tilde{Y}}$ with $(K - 2)$-fold CV over the $K - 1$ folds in $D_{\operatorname{train}}.$
2. Train model on $D_{\operatorname{train}}$. Compute errors $e_{i}$ for all samples $(x_{i}, y_{i}) \in D_{\operatorname{out}}.$
3. Compute $\overline{e}_{\operatorname{out}} :=$mean of $\lbrace e_{i} \rbrace_{i \in I_{\operatorname{out}}}.$
4. Set $a_{j} := (\widehat{\operatorname{Err}}_{\tilde{X} \tilde{Y}} - \overline{e}_{\operatorname{out}})^2$ (estimate of $(a) $).
5. Set $b_{j} :=$empirical variance of $\lbrace e_{i} \rbrace_{i \in I_{\operatorname{out}}}$ (estimate of $(b)$).
Output $\widehat{\operatorname{MSE}} := \operatorname{mean} (a_{j}) -\operatorname{mean} (b_{j}).$
Output $\widehat{\operatorname{Err}}^{\operatorname{ncv}} := \operatorname{mean} (\varepsilon _{j}).$

The key result is that Algorithm 1 is a good approximation of the $\operatorname{MSE}$ of CV (for a smaller sample size):

Theorem 4: ([Bat21C] (Estimand of nested CV)) For a nested CV with a sample of size $n$,

\[ \mathbb{E} [\widehat{\operatorname{MSE}}] =\operatorname{MSE}_{K - 1, n - n / K} . \]

As noted before, the fact that estimation happens on $K - 1$ folds introduces some bias for the actual quantity of interest $\operatorname{MSE}_{K, n}.$ One can rescale $\widehat{\operatorname{MSE}}$ with the factor $\tfrac{K - 1}{K}$, although is just a heuristic with no theoretical guarantee.

For the same reason, $\widehat{\operatorname{Err}}^{\operatorname{ncv}}$ is unbiased for $\operatorname{Err}$ with data set of size $n (K - 2) / K$ but biased for size $n.$ A debiasing strategy is suggested with the estimator:

\[ \hat{b}^{\operatorname{ncv}} := \left( 1 + \frac{K - 2}{K} \right) (\widehat{\operatorname{Err}}^{\operatorname{ncv}} - \widehat{\operatorname{Err}}^{\operatorname{cv}}), \]

and finally, the confidence interval for $\operatorname{Err}$ becomes

\[ \hspace{-2em} (\widehat{\operatorname{Err}}^{\operatorname{ncv}} - \hat{b}^{\operatorname{ncv}} - z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{ncv}}, \]

\[ \hspace{2em} \widehat{\operatorname{Err}}^{\operatorname{ncv}} - \hat{b}^{\operatorname{ncv}} + z_{1 - \alpha / 2} \widehat{\operatorname{se}}^{\operatorname{ncv}}), \]

with $\widehat{\operatorname{se}}^{\operatorname{ncv}} := \sqrt{\frac{K - 1}{K} \widehat{\operatorname{MSE}}}.$

On computational cost. Obviously, the amount of computation required to produce $\widehat{\operatorname{Err}}^{\operatorname{ncv}}$ can be orders of magnitude higher than simple CV, depending on the number of folds. Because of the embarrassingly parallel nature of the problem this can be a minor issue for applications where CV itself runs rather quickly, but makes nested CV inpracticable when it does not.

5 Results and conclusions

The authors test Nested CV for classification and regression with synthetic and real data sets. For binary classification, they use a sparse logistic data generating process $\mathbb{P} (Y_{i} = 1| X_{i} = x_{i}) = \sigma (- x_{i}^{\top} \theta)$, for $i \in \lbrace 1, \ldots, n \rbrace$, $X_{i} \sim \mathcal{N} (0, I_{p})$, and $\theta := c (1, 1, 1, 1, 0, \ldots, 0) \in \mathbb{R}^p,$ $c > 0$ chosen to obtain different Bayes risks.9 9 After some computation one finds the Bayes risk to be the integral of $2 \sigma (- \theta^\top x) \mathcal{N} (x ; 0, I_{p})$ over $\lbrace \theta \cdot x \leqslant 0 \rbrace.$ Solving for $\theta$ provides the required value of $c.$ These are optimal lower bounds for $\operatorname{Err}.$

In the low dimensional regime ($p = 20, n = 100$), with $c$ chosen to have a Bayes error of 33% and an unregularized logistic regression model, they obtain miscoverage rates for $\operatorname{Err}$ and $\operatorname{Err}_{X Y}$ close to the nominal values of 10%, as seen in Table 5.

Target	Point estim.		Miscoverage
			CV		NCV
	CV	NCV	Hi	Lo	Hi	Lo
Err_X⁣Y	39.6%	39%	10%	8%	5%	3%
Err	”	”	9%	8%	3%	4%

([Bat21C] (Table 1, excerpt)) Performance of cross-validation (CV) and nested cross-validation (NCV) for low-dim logistic regression. «A “Hi” miscoverage is one where the confidence interval is too large and the point estimate falls below the interval; conversely for a “Lo” miscoverage. The standard error in each coverage estimate reported is about 0.5%. The “Target” column indicates the target of coverage. The intervals are always generated identically(…)»

Similar experiments in the high dimensional setting ($n \in \lbrace 90, 200 \rbrace, p = 1000$), with $\ell _{1}$ regularization shows that the intervals obtained through NCV show miscoverage much closer to nominal with “high” and “low” rates of 3 to 6% each for NCV as opposed to 10 to 20% for CV.

In the linear regression setting, with $p = 20$ and $n > 100$ and a model fitted with ordinary least squares, NCV provides again better coverage than CV up to dimension around 400. From here on, both methods perform similarly. As a matter of fact, several recent asymptotic results in the literature make it expected that the violations in coverage vanish as $n \rightarrow \infty$ for fixed $p.$ As a general rule of thumb, one can expect CV to perform well when $n / p$ is large and regularization is used, so NCV should be used in high dimensional settings or low data regimes.

Finally, a high dimensional sparse linear model with a lasso estimator repeats the conclusions, although both methods fail at very low sample numbers. Figure 5. ([Bat21C] (Fig. D.2)) «Width of nested CV intervals relative to the width of the naive CV intervals»

Further experiments with real data sets use demographic and radar measurements again with similar results. Please see section 6 of the paper.

References

[Tib09B]

A bias correction for the minimum error rate in cross-validation, Ryan J. Tibshirani, Robert Tibshirani.

Jun 2009

Tuning parameters in supervised learning problems are often estimated by cross-validation. The minimum value of the cross-validation error can be biased downward as an estimate of the test error at that same value of the tuning parameter. We propose a simple method for the estimation of this bias that uses information from the cross-validation process. As a result, it requires essentially no …

[Gua18T]

Test Error Estimation after Model Selection Using Validation Error, Leying Guan.

Feb 2018

When performing supervised learning with the model selected using validation error from sample splitting and cross validation, the minimum value of the validation error can be biased downward. We propose two simple methods that use the errors produced in the validating step to estimate the test error after model selection, and we focus on the situations where we select the model by minimizing the …

[Bat21C]

Cross-validation: what does it estimate and how well does it do it?, Stephen Bates, Trevor Hastie, Robert Tibshirani.

Apr 2021

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error …

[Ben04N]

No Unbiased Estimator of the Variance of K-Fold Cross-Validation, Yoshua Bengio, Yves Grandvalet.

Dec 2004

Most machine learning researchers perform quantitative experiments to estimate generalization error and compare the performance of different algorithms (in particular, their proposed algorithm). In order to be able to draw statistically convincing conclusions, it is important to estimate the uncertainty of such estimates. This paper studies the very commonly used K-fold cross-validation estimator …

[Var06B]

Bias in error estimation when using cross-validation for model selection, Sudhir Varma, Richard Simon.

Feb 2006

Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data.