Improving cooperative game theory-based data valuation via data utility learning

Data valuation is the process of assigning a value to data points representing their contribution to the performance of a given model wrt. some metric, as part of a fixed dataset. Popular techniques are based on cooperative game theory concepts like Shapley value or the least core which, despite their interesting theoretical properties, have high computational cost: To obtain the value of one training point $x \in D_{\operatorname{train}}$, one must retrain for each of the $2^n$ subsets $S \in \mathcal{P} (D_{\operatorname{train}})$ and evaluate the performance of the obtained model (on a fixed validation set). This performance is called the utility of $S$. Typically, some kind of Monte Carlo approximation is used to avoid retraining $2^n$ times. We provide an introduction to and survey of key developments in the field in our review on data valuation.

In [Wan22I] the authors propose retraining only a number $m_{\operatorname{train}} \ll 2^n$ of times. Each subset $S_{i}$ seen is identified with a binary 1-hot encoded vector $b_{i} \in \lbrace 0, 1\rbrace^n$ of the indices present in $S_{i}$. After computing the utility for each of the $S_{i}$, one has a collection of utility samples, which are the tuples $\mathcal{S}_{\operatorname{train}} := \lbrace (b_{i}, u (b_{i})) : i = 1, \ldots, m_{\operatorname{train}} \rbrace$. These can be used to learn a surrogate utility function $\hat{u}$ which is then evaluated on unseen subsets of $D_{\operatorname{train}}$ to obtain $\mathcal{S}_{\operatorname{pred}} := \lbrace (b_{i}, \hat{u} (b_{i})) : i = 1, \ldots, m_{\operatorname{pred}} \rbrace$. The value is then estimated using $\mathcal{S}_{\operatorname{train}} \cup \mathcal{S}_{\operatorname{pred}}$. Since each evaluation of $\hat{u}$ is typically much cheaper than a full retraining, the speed-ups can be considerable.

The authors also prove some average and worst case bounds for the error approximating $u$ (i.e. when the noise $\hat{u} - u$ is smooth and when it is adversarially chosen, respectively), and demonstrate empirically that using the estimated utility samples $\mathcal{S}_{\operatorname{pred}}$ in Monte Carlo methods results in decent approximations to the true value at a reduced cost. Check Theorems 1 and 2 in the paper for the details.

We will be sure to include this technique in our upcoming pyDVL library for everybody to try out in their projects.

# References

• Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia. (2022)