Rather than better models or more data, good data is very often they key to a successful application of machine learning. Sophisticated models can only go so far and, almost invariably for real business applications, improvements in data acquisition, annotation and cleaning are a much better investment of resources than researching complex models. As part of our mission to help practitioners to make the most of their time and their data, we have developed pyDVL, the python Data Valuation Library.
As of version 0.7.1, pyDVL provides robust, parallel implementations of most popular methods for data valuation and influence analysis. We are also working on a benchmarking suite to compare them.
pip install pydvl, or check out the documentation
- Leave One Out
- Data Shapley [Gho19D] values with different sampling methods
- Truncated Monte Carlo Shapley [Gho19D]
- Exact Data Shapley for KNN [Jia19aE]
- Owen sampling [Okh21M]
- Group testing Shapley [Jia19aE]
- Least Core [Yan21I]
- Data Utility Learning [Wan22I]
- Data Banzhaf [Wan22D]
- Beta Shapley [Kwo22B]
- Generalized semi-values
- Data-OOB [Kwo23D]
In addition to these, we are developing a robust framework for the computation of influence functions using:
- Conjugate Gradient [Koh17U]
- Linear (time) Stochastic Second-Order Algorithm (LiSSA) [Aga17S]
- Arnoldi iteration [Sch21S]
Finally, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples for most of them.
We are currently implementing or plan to implement:
- Class-Wise Shapley [Sch22C]
- Neural Tangent Kernel scorer [Wu22D]
- Improved parallelization and caching strategies
- Lazy evaluation of influence factors for large models
- Variance-reduced sampling methods for Shapley values
- (Approximate) Maximum Influence Perturbation [Bro21A]
To see what new methods, features and improvements are coming, check out the issues on GitHub.