Resolving Training Biases via Influence-based Data Relabeling

Good quality datasets are essential to the successful training of supervised models. However, no matter how much attention is given to data cleaning, errors are bound to be included and this can cause poor performance even on relatively simple tasks.

In recent years, influence functions have re-emerged as a useful tool to estimate the impact of data samples on model’s predictions thanks to works such as [Koh17U]. First introduced in the 70s as a tool for robust statistics (see e.g. [Ham74I] ), they estimate the influence of each training data point on the model’s predictions using a derivative of the loss around a training sample.

Similarly, data resampling is a widely used strategy to deal with harmful samples. It re-weights input data based on their training loss: those with higher losses are assumed to have corrupted labels, and hence it could be beneficial to down-weigh them during training. Nevertheless, such loss-based resampling methods are known to have limitations, e.g. instabilities towards large portions of mislabelled data (for details see [Zha16U]).

To address these limitations, influence functions have recently been used (instead of training losses) in the resampling scheme. Inspired by the success of such approaches, [Kon22R] moves one step further and re-labels the harmful data-points (instead of just decreasing their weight) based on the results of influence analysis. The new approach is named RDIA, and can be found on GitHub.

Average test loss of several re-labelling methods on 4 different datasets.

But why would this be better than just removing the bad samples? In Figure 1 four datasets are progressively corrupted with different amounts of noise (as reported on the x-axis). Besides RDIA, the other methods are

ERM: the usual full training.
Random: randomly selecting and changes the label of training samples.
UIDS: introduced in [Wan20L].
Dropout: removes all training data with negative influence in the validation set.

RDIA performs better than all the other methods, and it shows good results even with very high noise. This is somewhat expected, since it is better in leveraging the information coming from the validation data, and corrects the labels accordingly in the training set.

While the applicability of this approach may be limited in practice, moving towards a data-centric development of ML models could yield many benefits. Model training and data cleaning, which typically are separate steps of the MLOps pipeline, are slowly becoming part of a two-legged iterative process, where the errors of one lead to the refinement of the other. In this paper it is shown that this could give surprisingly good improvements to model accuracy.

How can we explain the predictions of a black-box model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a …

This paper treats essentially the first derivative of an estimator viewed as functional and the ways in which it can be used to study local robustness properties. A theory of robust estimation "near" strict parametric models is briefly sketched and applied to some classical situations. Relations between von Mises functionals, the jackknife and U-statistics are indicated. A number of classical and …

Through extensive systematic experiments, we show how the traditional approaches fail to explain why large neural networks generalize well in practice, and why understanding deep learning requires...

References

In this series →