pyDVL: the python Data Valuation Library

pyDVL strives to offer reference implementations of algorithms for data valuation with a common interface compatible with sklearn and a benchmarking suite to compare them.
Install the latest pyDVL with pip install pydvl

pyDVL is a library for data valuation (see our review). You can install with PyPi, contribute to the project or check out the documentation.

As of version 0.5.0, pyDVL provides robust, parallel implementations of:

  • Leave One Out
  • Data Shapley
    • Exact values with different sampling methods
    • Exact values, for KNN [Jia19aE]
    • Truncated Monte Carlo Shapley [Gho19D]
    • Owen sampling [Okh21M]
  • Group testing Shapley [Jia19aE]
  • Least Core [Yan21I]
  • Data Utility Learning [Wan22I]
  • Influence functions (beta / in progress) [Koh17U]
    • Fast Hessian-Vectors with stochastic optimization for large models
    • (Approximate) Maximum Influence Perturbation (upcoming)

In addition, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples.

Usage

If you have any scikit-learn compatible supervised model, you can use pyDVL in three easy steps:

  1. Create a Dataset object with your train and test splits.
  2. Create a Utility object to wrap the Dataset, the model and a scoring function.
  3. Use one of the methods defined in the library to compute the values.
    from sklearn.datasets import load_breast_cancer
    from sklearn.linear_model import LogisticRegression
    from pydvl.utils import Dataset, Utility, Scorer
    from pydvl.value import *
    
    data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
    model = LogisticRegression()
    u = Utility(model, data, Scorer("accuracy", default=0.0))
    values = compute_shapley_values(
        u,
        mode=ShapleyMode.TruncatedMontecarlo,
        done=MaxUpdates(100),
        truncation=RelativeTruncation(u, rtol=0.01),
    )
    values.sort(key="value")
    print(f"The lowest 5% valued points are: {values.indices[:int(len(values) * 0.05)]}")
    

Roadmap

We are currently implementing

To see what new methods, features and improvements are coming, check out the issues on GitHub.

References

  • Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou. Proceedings of the 36th International Conference on Machine Learning, PMLR (2019)
  • Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song. Proceedings of the VLDB Endowment (2019)
  • Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning, Yongchan Kwon, James Zou. Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, (2022)
  • Understanding Black-box Predictions via Influence Functions, Pang Wei Koh, Percy Liang. Proceedings of the 34th International Conference on Machine Learning (2017)
  • A Multilinear Sampling Algorithm to Estimate Shapley Values, Ramin Okhrati, Aldo Lipani. 2020 25th International Conference on Pattern Recognition (ICPR) (2021)
  • CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification, Stephanie Schoch, Haifeng Xu, Yangfeng Ji. Proc. of the thirty-sixth Conference on Neural Information Processing Systems (NeurIPS) (2022)
  • Data Banzhaf: A Robust Data Valuation Framework for Machine Learning, Jiachen T. Wang, Ruoxi Jia. (2022)
  • Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning, Tianhao Wang, Yu Yang, Ruoxi Jia. (2022)
  • If You Like Shapley Then You’ll Love the Core, Tom Yan, Ariel D. Procaccia. Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021 (2021)

In this series