pyDVL: the python Data Valuation Library

pyDVL strives to offer reference implementations of algorithms for data valuation with a common interface compatible with sklearn and a benchmarking suite to compare them.
The library is still under active development and will be released soon as open source.

pyDVL is a library for data valuation (see our review). It explores and implements:

  • LOO
  • DataShapley
    • Exact (combinatorial)
    • Exact, for KNN
    • Truncated Monte Carlo Shapley
    • G-Shapley
    • Sparsity aware
    • KNN-Shapley surrogates
  • Distributional Shapley
    • Generic
    • Optimizations for linear regression, binary classification, kernel density estimation
  • Influence functions
    • Fast Hessian-Vectors with stochastic optimization for large models
    • (Approximate) Maximum Influence Perturbation

In addition, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples.

Contact us for access

In this series