Improving cooperative game theory-based data valuation via data utility learning

Data valuation is the process of assigning a value to data points representing their contribution to the performance of a given model wrt. some metric, as part of a fixed dataset. Popular techniques are based on cooperative game theory concepts like Shapley value or the least core which, despite their interesting theoretical properties, have high computational cost: To obtain the value of one training point $x \in D_{\operatorname{train}}$, one must retrain for each of the $2^n$ subsets $S \in \mathcal{P} (D_{\operatorname{train}})$ and evaluate the performance of the obtained model (on a fixed validation set). This performance is called the utility of $S$. Typically, some kind of Monte Carlo approximation is used to avoid retraining $2^n$ times. We provide an introduction to and survey of key developments in the field in our series page on data valuation.

In [Wan22I] the authors propose retraining only a number $m_{\operatorname{train}} \ll 2^n$ of times. Each subset $S_{i}$ seen is identified with a binary 1-hot encoded vector $b_{i} \in \lbrace 0, 1\rbrace^n$ of the indices present in $S_{i}$. After computing the utility for each of the $S_{i}$, one has a collection of utility samples, which are the tuples $\mathcal{S}_{\operatorname{train}} := \lbrace (b_{i}, u (b_{i})) : i = 1, \ldots, m_{\operatorname{train}} \rbrace$. These can be used to learn a surrogate utility function $\hat{u}$ which is then evaluated on unseen subsets of $D_{\operatorname{train}}$ to obtain $\mathcal{S}_{\operatorname{pred}} := \lbrace (b_{i}, \hat{u} (b_{i})) : i = 1, \ldots, m_{\operatorname{pred}} \rbrace$. The value is then estimated using $\mathcal{S}_{\operatorname{train}} \cup \mathcal{S}_{\operatorname{pred}}$. Since each evaluation of $\hat{u}$ is typically much cheaper than a full retraining, the speed-ups can be considerable.

The authors also prove some average and worst case bounds for the error approximating $u$ (i.e. when the noise $\hat{u} - u$ is smooth and when it is adversarially chosen, respectively), and demonstrate empirically that using the estimated utility samples $\mathcal{S}_{\operatorname{pred}}$ in Monte Carlo methods results in decent approximations to the true value at a reduced cost. Check Theorems 1 and 2 in the paper for the details.

We will be sure to include this technique in our upcoming pyDVL library for everybody to try out in their projects.

The Shapley value (SV) and Least core (LC) are classic methods in cooperative game theory for cost/profit sharing problems. Both methods have recently been proposed as a principled solution for data valuation tasks, i.e., quantifying the contribution of individual datum in machine learning. However, both SV and LC suffer computational challenges due to the need for retraining models on …

References

In this series →