In this upcoming **TfL review** we provide a self-contained
introduction to data valuation and investigate recent developments, its
strengths and weaknesses and how these methods are applied to problems of:

- Model interpretation and debugging.
- Model performance improvement.
- Batch active learning.
- Data source improvement.
- Sensitivity / robustness analysis.
- Compensation for data providers.

Rather than better models or more data, *good* data is very often they key to a
successful application of machine learning. Sophisticated models can only go so
far and, almost invariably for real business applications, improvements in data
acquisition, annotation and cleaning are a much better investment of resources
than implementing complex models. However, it has only been recently that
rigorous and useful notions of *value* for data have appeared in the ML
literature.

In a nutshell, **data valuation** is the task of assigning a number to each
element of a training set $D$ which reflects its contribution to the final
performance of a model trained on $D$. Note that this value is not an invariant
of the datum, but a function of three factors:

- The dataset $D$, or ideally, the distribution it was sampled from. With this we mean that “value” would ideally be the (expected) contribution of a data point to any random set $D$ sampled from the same distribution.
- The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$ in a model class $\mathcal{F}$. E.g. MS E minimization to find the parameters of a linear model.
- The performance metric of interest for the problem. E.g. the $R^2$ score over a test set, or the MSE over a few data points.

A related, more local, concept is that of the *influence* that training samples
have on a test sample $z$. Roughly speaking, the **influence function**
reflects how much the loss for $z$ would change if training data are slightly
perturbed. A key feature is that it can be computed without retraining.

Unfortunately, most methods for data valuation have high computational cost, e.g. due to the need to compute second order derivatives or to a combinatorial explosion forcing to retrain the model a number of times exponential in the number of samples.