Data (e)valuation and model interpretation: a game theoretic approach

Attributing a “fair” (for some definition of fair) value to training samples has multiple applications. It can be used to investigate or improve data sources, but it can also help detect outliers and to investigate how certain features and the values thereof influence global model performance. It is also possible to improve model performance: removing points of low value can decrease model error. Finally, in problems with few data of high cost (e. g. some types of medical images), a fair value can be used to compensate providers for their data.

Alternatively, attributing global values (as opposed to local, i.e. around single predictions) to features can enormously help guide the process of data collection and cleaning towards those having the highest impact, thus saving time and resources. Also, by identifying worthy features, companies can gain insight into their businesses based on their data.

References

  • Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou. International Conference on Machine Learning (2019)
  • A Distributional Framework For Data Valuation, Amirata Ghorbani, Michael Kim, James Zou. International Conference on Machine Learning (2020)
  • Data Valuation using Reinforcement Learning, Jinsung Yoon, Sercan Ö Arık, Tomas Pfister. Proceedings of the 37 th International Conference on Machine Learning, (2020)
  • Bounding the Estimation Error of Sampling-based Shapley Value Approximation, Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, Alex Rogers. arXiv:1306.4265 [cs] (2014)

In this series