Rather than better models or more data, *good* data is very often they key to a
successful application of machine learning. Sophisticated models can only go so
far and, almost invariably for real business applications, improvements in data
acquisition, annotation and cleaning are a much better investment of resources
than researching complex models. As part of our mission to help practitioners
to make the most of their time and their data, we have developed **pyDVL**, the
python Data Valuation Library.

As of **version 0.7.1**, pyDVL provides robust, parallel implementations of most
popular methods for data valuation and influence analysis. We are also working
on a benchmarking suite to compare them.

`pip install pydvl`

, or check out the documentation## Methods implemented

- Leave One Out
- Data Shapley [Gho19D] values with different sampling methods
- Truncated Monte Carlo Shapley [Gho19D]
- Exact Data Shapley for KNN [Jia19aE]
- Owen sampling [Okh21M]
- Group testing Shapley [Jia19aE]
- Least Core [Yan21I]
- Data Utility Learning [Wan22I]
- Data Banzhaf [Wan22D]
- Beta Shapley [Kwo22B]
- Generalized semi-values
- Data-OOB [Kwo23D]

In addition to these, we are developing a robust framework for the computation of influence functions using:

- Conjugate Gradient [Koh17U]
- Linear (time) Stochastic Second-Order Algorithm (LiSSA) [Aga17S]
- Arnoldi iteration [Sch21S]

Finally, we provide analyses of the strengths and weaknesses of key methods, as well as detailed examples for most of them.

## Roadmap

We are currently implementing or plan to implement:

- Class-Wise Shapley [Sch22C]
- Neural Tangent Kernel scorer [Wu22D]
- Improved parallelization and caching strategies
- Lazy evaluation of influence factors for large models
- Variance-reduced sampling methods for Shapley values
- (Approximate) Maximum Influence Perturbation [Bro21A]

To see what new methods, features and improvements are coming, check out the issues on GitHub.