pyDVL: the python Data Valuation Library

pyDVL strives to offer reference implementations of algorithms for data valuation with a common interface compatible with sklearn and a benchmarking suite to compare them.
Install the latest pyDVL with pip install pydvl

pyDVL is a library for data valuation (see our review). You can install with PyPi, contribute to the project or check out the documentation.

As of version 0.3.0, pyDVL provides robust, parallel (in most cases) implementations of:

  • Leave One Out
  • Data Shapley
    • Exact values with different sampling methods
    • Exact values, for KNN [Jia19aE]
    • Truncated Monte Carlo Shapley [Gho19D]
    • Owen sampling [Okh21M]
    • Sparsity aware (upcoming)
    • KNN-Shapley surrogates (upcoming)
  • Beta Shapley [Kwo22B]
  • Least core [Yan21I] (coming soon)
  • Data Banzhaf [Wan22D] (coming soon)
  • Data Utility Learning [Wan22I]
  • Distributional Shapley (upcoming)
    • Generic
    • Optimizations for linear regression, binary classification, kernel density estimation
  • Influence functions (beta / in progress)
    • Fast Hessian-Vectors with stochastic optimization for large models
    • (Approximate) Maximum Influence Perturbation (upcoming)

In addition, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples.

References

  • Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, Jiachen T. Wang, Ruoxi Jia. (2022)
  • Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning, Yongchan Kwon, James Zou. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, (2022)
  • Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou. Proceedings of the 36th International Conference on Machine Learning, PMLR (2019)
  • Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song. Proceedings of the VLDB Endowment (2019)
  • A Multilinear Sampling Algorithm to Estimate Shapley Values, Ramin Okhrati, Aldo Lipani. 2020 25th International Conference on Pattern Recognition (ICPR) (2021)
  • Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia. (2022)
  • If You Like Shapley Then You’ll Love the Core, Tom Yan, Ariel D. Procaccia. Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021 (2021)

In this series