Data valuation techniques. Applications, advances and problems

The attribution of value to data points can help in data collection, active learning, gauging model robustness, improving model performance and paying providers for their data. We review recent developments in the field and evaluate whether they are ripe for use in applications.

In this upcoming TfL review we provide a self-contained introduction to data valuation and investigate recent developments, its strengths and weaknesses and how these methods are applied to problems of:

  • Model interpretation and debugging.
  • Model performance improvement.
  • Batch active learning.
  • Data source improvement.
  • Sensitivity / robustness analysis.
  • Compensation for data providers.

Rather than better models or more data, good data is very often they key to a successful application of machine learning. Sophisticated models can only go so far and, almost invariably for real business applications, improvements in data acquisition, annotation and cleaning are a much better investment of resources than implementing complex models. However, it has only been recently that rigorous and useful notions of value for data have appeared in the ML literature.

In a nutshell, data valuation is the task of assigning a number to each element of a training set $D$ which reflects its contribution to the final performance of a model trained on $D$. Note that this value is not an invariant of the datum, but a function of three factors:

  • The dataset $D$, or ideally, the distribution it was sampled from. With this we mean that “value” would ideally be the (expected) contribution of a data point to any random set $D$ sampled from the same distribution.
  • The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$ in a model class $\mathcal{F}$. E.g. MS E minimization to find the parameters of a linear model.
  • The performance metric of interest for the problem. E.g. the $R^2$ score over a test set, or the MSE over a few data points.

A related, more local, concept is that of the influence that training samples have on a test sample $z$. Roughly speaking, the influence function reflects how much the loss for $z$ would change if training data are slightly perturbed. A key feature is that it can be computed without retraining.

Unfortunately, most methods for data valuation have high computational cost, e.g. due to the need to compute second order derivatives or to a combinatorial explosion forcing to retrain the model a number of times exponential in the number of samples.

In this series