Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

In data valuation, relaxing the Shapley axioms leads to weighting schemes for marginal utilities that improve the quality of values

Data valuation is the process of assigning a number to samples which reflects their contribution to final model performance. A typical application is to remove low-value samples to increase accuracy, another one is to examine high values to improve data acquisition or model design (if you want to try out these ideas, the TransferLab recently released pyDVL, a python library aspiring to provide the tools for easy and robust data valuation).

The most popular methods rely on Shapley values, or variations thereof. The value of a point is a weighted sum of its marginal contribution (or “utility”) to all possible subsets of the training set:

$$v_u(x_i) = \frac{1}{n} \sum_{S \subseteq D \setminus {x_i}} \binom{n-1}{ | S | }^{-1} [u(S \cup {x_i}) − u(S)],$$

where $u(S)$ is the utility of a subset $S \subset D_\text{train}$ and is typically the accuracy over a test set of the model trained on $S$.

This classical notion of value is unique in fulfilling certain classical axioms. One of them, called “efficiency”, captures the need for the total sum to equal the total utility. This is particularly desirable when the values are used to distribute resources, e.g. in economics, but it is not so important in ML, where ranking is typically sufficient. Figure 1: Illustration of the normalized weight $\tilde{w}^{(n)}_{\alpha, \beta} (j)$ for various pairs of $\alpha, \beta$ when $n = 200$.

Consequently, it makes sense to consider the class of all values that fulfil all but this axiom. This class of so-called semi-values turns out to be fully characterised as weighted sums of marginal utilities over all possible subsets, like Shapley, with weights fulfilling a simple condition.

In [Kwo22B] (presented this year at AISTATS), the authors choose weights using the Beta function, to up- or down-weight each cardinality of subset $S.$ The value of a point is computed as:

$$v_{\alpha, \beta} (i) := \frac{1}{n} \sum_{j = 1}^n c_j \sum_{S \subset D^{(j)}_{- i}} \tilde{w}^{(n)}_{\alpha, \beta} (j) (u (S_{+ i}) - u (S)),$$

for some coefficients $c_j$ and $\tilde{w}_{\alpha, \beta}$ and where $D^{(j)}_{-i}$ is the set of all subsets of $D\setminus{x_i}$ of size $j-1$ (notation slightly different from that of the paper for simplicity). The parameters $\alpha$ and $\beta$ determine how different set cardinalities are weighted, as seen in Figure 1.

Beta Shapley is shown to (slightly) outperform the state of the art on several tasks on which valuation methods are typically benchmarked, such as: 1) detecting mislabeled training data; 2) learning with subsamples; and 3) identifying points whose addition or removal have the largest positive or negative impact on the model (Figure 2).

Figure 2: Effect of addition and removal of high or low value points, for a classification task and 6 valuation methods.

One reason why $B(16,1)$ seems to work better in most cases is that it puts more weight on low sample sizes, for which the signal (wrt. value) is higher, because of diminishing returns in large sample sizes. However, the authors don’t comment on the high variance which is to be expected for very low sample sizes. One would expect that a different choice of parameters putting very little weight e.g. on $j=1$ would perform better. And as a matter of fact, in the implementation (which does not use the formula above but samples permutations as is usually done) they do skip low set sizes altogether. Sadly, other important implementation details are not even mentioned, like early stopping or number of iterations.

The pressing issue of the variance of the utility, which we have encountered while developing and using pyDVL, is finally addressed in a recent paper proposing a worst-case strategy, namely constant weights. A pill on the subject will appear shortly.

References

In this series