In this blog post, we work within the frameworks introduced in our series on Data valuation, primarily focusing on the model-dependent context. Here, methods compute the worth of individual training samples based on “how much they contribute” to the final performance of a model over a valuation set. For details on the methods themselves, we refer to the introductory section of our series and to our paper pills.

1 Data engineering

Perhaps the most relevant uses of data value are in the tasks of improving data and data collection processes, as they impact almost every application in science and industry.

1.1 Repairing and pruning corrupt data

Data can be corrupted in many ways, be it adversarially or not: labels or features can be noisy, and training samples can be tampered with to globally reduce performance or to enable targeted misclassification at test time.

With the valuation function for the training set in our hands, we can try to clean the data to improve performance. By ranking all samples according to their data value and discarding a portion of the lowest-valued ones, followed by retraining the model, we can potentially enhance the model’s performance. The intuition is that in-distribution points should have higher values than contaminated or extraneous ones. And indeed, empirically, this procedure tends to improve test error to a certain degree, depending on the quality of the initial data, the robustness of the model, and the accuracy of the computation of the value function. [Gho19D] illustrates this in several experiments, and we see the same behaviour in our own tests with multiple approaches, see [Tra22P] and Figure 1.

The first issue is that there is no automated way of determining the threshold at which to stop removing low-value samples, and by iteratively removing them and retraining one overfits to the test set. For this reason, instead of trying to automate the process, one can involve domain experts to examine the data, both low- and high-value, to identify significant patterns (cf. Section 1.4). One successful use case is the construction of scientific or benchmark datasets [Tan21D]. Generally speaking data valuation can be a useful tool when used carefully, not only for the potential gain in performance, but also because of the insights gained into what makes data good or bad.

A more fundamental difficulty derives from an intrinsic weakness of metric evaluation over a fixed valuation set. The ability to distinguish harmful data will strongly depend on whether that set is clean of outliers itself or not, and on the robustness of the model to outliers in the training data. These drawbacks are common to all supervised anomaly detection methods, and using (negative) data value as an anomaly score is fraught with the same problems. Techniques like Data-OOB [Kwo23D] and CGS [Noh23D] circumvent this issue either through bagging, or by avoiding training the model altogether.

In a very similar vein, when data is scarce, instead of discarding data that impairs performance, we can try to identify what needs fixing beforehand in order to reduce time spent in discussion with customers, domain experts or data providers. Figure 1. Removing points of low value in succession for 2 game-theoretic valuation methods and a random baseline.

Since influence functions can be computed for each individual test sample, they provide a method to decide which labels to fix first, namely those of training samples that are highly influential for erroneous predictions.

Finally, data value is sometimes used to explain the actions of black-box data repair tools (commercial tools to impute missing data, but whose actions are opaque to the user). For more on this, see [Deu21E] and the references therein.

1.2 Pruning superfluous data

In contrast to the previous setting, where we aimed at identifying harmful data, pruning superfluous data aims at removing redundant or uninformative data. Before, we could focus on high- or low-value points, but uninformative points will tend to have values concentrated around zero for many methods.

In deep learning, longstanding observations have shown power scaling laws that describe error reduction as a function of increasing training set or model size, which drive the increasingly high computational and energy costs of training large models. However, recent research shows that one can improve the situation and possibly achieve exponential scaling by choosing a good metric to dictate the order in which to discard training examples for any pruned dataset size. [Sor22N] run extensive benchmarks and perform an exploration of optimal scaling laws. Figure 2. [Sor22N] An overview of TS-DShapley: 1) Process the data using the target LM; 2) Compute sampling chains using a subset of the training set and aggregate the resulting Shapley values; and 3) Transfer the estimated data value information for use with the target LM by estimating the optimal low value data removal index. Alas, they demonstrate that there is no silver bullet, and show how the situations in which pruning is doable, desirable or counterproductive depend on model capacity, the amount of data available and its quality.

One way to prune the data effectively is to do the following: First, train a simple model to define the metric. Then, use it to throw some of the data away and train the costly model on the remaining data.

A related application is fine-tuning, where a pre-trained model is fine-tuned on a pruned new data set, guided by a metric that uses the initial model. [Sch23D] proposes this workflow as TS-DShapley (Figure 2), the first application of data valuation in the context of LLMs. The initial model is used to compute embeddings for the fine-tuning dataset, and a simple proxy model is trained on these. Values are then estimated and subsequently used to prune this data by removing the lowest-valued points.

In a more adversarial setting, data offered by a provider can be trivially augmented, duplicated or simply unrelated in order to inflate size in an attempt to increase the price. A data purchaser is therefore interested in identifying irrelevant data.1 1 For more on augmented and irrelevant data, see [Jus23L] Alas, data markets typically only offer previews of the datasets, making an effective pruning strategy impossible, except if offered by the market platform itself.

Influence functions can be used for the pruning metric, and optimal transport valuation has been proposed in [Jus23L], but more research and testing are required, especially in the application to data markets.

Although not strictly a valuation method, CRAIG [Mir20C] is worth mentioning as a data-centric technique to select an interesting data pruning technique. The idea is to pick a batch over which the gradient most closely approximates the full gradient in order to train only over it. This is performed with a greedy search using marginal utility as the objective. Theoretical guarantees (for the exact solution) hold in certain settings, in particular assuming a Lipschitz condition on the gradient of the model.

1.3 Batch active learning

Another application of data valuation is the labeling of new data, a task whose cost often necessitates carefully selecting what to label next. Batch active learning is a method to enhance this efficiency by selecting groups, or “batches”, of new samples to optimize learning performance. The general idea is as follows:2 2 For a very good review of most techniques in batch active learning, we recommend the excellent blog post by Lilian Weng [Wen22L].

The model is trained on an initial set of labeled data.
A score is computed for each unlabeled data point in the dataset, reflecting the potential value of labeling that point. The scoring can be based on information gain, diversity of data or expected influence on the model.
Based on these scores, a batch of data points is selected for labeling. The goal is to choose a diverse set of high-scoring points that will collectively add significant new information to the model.
These new labeled points are added to the training set, and the model is retrained on the updated set.

As an alternative to commonly used scoring functions in the field, such as information gain, [Gho22D] proposes to use Shapley values. First these are computed for the training set, and then a regression model is trained on them.3 3 Actually, because computing Shapley values is so expensive, a surrogate KNN model replaces the last layers a base DNN, effectively computing the Shapley value for a KNN classifier over the embeddings learnt by the network for each sample, which can be done exponentially faster than in the general case, thanks to the local structure of KNN classifiers. This method was introduced in [Jia19aE]. This valuation model is used to estimate the value of new samples, which is then used to select the next batch, see Figure 3. The authors report their method to work well for noisy or heterogeneous data, and even under domain shift.

Yet another option is the influence function: [Liu21I] use it as an estimate of the change in the loss when adding an unseen sample to the training set. Because the influence requires the gradient of the loss w.r.t. model parameters, and consequently requires the label, they instead use the output of the model on the unlabeled sample as a pseudo-label.

Figure 3. [Gho22D] Active Data Shapley Enhancing a Diversity Method. (a) Given a trained model, labeled data is featurized, exact Shapley values are computed, and a regression model is trained to predict Shapley values from features. (b) Unlabeled data is featurized, and Shapley values are estimated with the regressor. (c) Unlabeled data is ranked by the estimated Shapley value. The top fraction is pre-selected and fed into any given diversity method to obtain the final batch of points to label

Reported reductions in the number of required annotations w.r.t. random sampling for a fixed test-time accuracy range from 10% to upwards of 30% for both methods. These numbers vary greatly with the domain (always computer vision in the cited papers) and measured accuracy, but can still represent substantial economic savings. Regrettably, to date, no benchmark of game-theoretic vs. influence-based methods has been performed within this context.

1.4 Data collection

Similarly to data labeling, the cost of acquisition often makes targeted choices necessary. If the values computed for the training samples are representative of the true population rather than being an artifact of the sampling, or, roughly put, if the value of a point is stable under changes in the dataset, then it makes sense to try and identify the most valuable points in order to gather more like them. This can also help in hypothesizing relations between data features and predictions: Imagine an accurate pricing model to assist human sales representatives. If high-value data points are identified and the most influential features in those are extracted in a subsequent step, the human operator can use this information when setting or negotiating prices.

Conversely, the least valuable points might help improve data collection, e.g., by detecting patterns of mislabeling or problems with specific data sources. This can save much effort, in particular in the first stages of a project.

2 Model development

2.1 Interpretation and debugging

Numerous techniques exist to examine the behaviour of supervised models considered as black boxes. While a complete taxonomy is beyond the scope of this text (for a thorough review, see, e.g., [Bod21B]), most methods revolve around test-time predictions. Some approaches seek global explanations, perhaps in terms of how input features affect the overall outcome, while others are local and focus on individual test samples.4 4 So-called (black-box) eXplainable AI comes with lots of caveats and pitfalls, like unstable explanations and conflicting outcomes from different methods [Gos19N]. Therefore, it is always advisable to prefer simpler, interpretable models or to use these techniques during development and debugging. The risk of bogus explanations negatively affecting human decisions is a serious concern. For an insightful review of the many dangers, see [Rud19S]. For an example (out of many) in clinical practice, see [Jac21H].

The approaches we consider instead look at the effect that individual training samples have on the result. By exploring the most or least valuable or influential of the samples, it is possible to explore the limitations of the model.

As an illustration, consider a $K$-class classification problem where accuracy is currently low for some class $k_{0}.$ Construct a valuation set restricted to samples in $k_{0},$ and compute the value of the training samples (i.e., their “contribution” towards achieving a good model performance measured on the restricted valuation set). Their values should then reflect their usefulness in predicting the class $k_{0}.$

Suppose now that the highest-ranking points are of a different true class $k_{i} \neq k_{0}.$ Because the predictions are wrong, we might suspect that the model is looking for irrelevant common features, or, in causal terms, confounders for the class posterior. Perhaps the data was gathered in different environments, and the model is sensitive to unsuspected features present across these (e.g., backgrounds or lighting conditions). In this case, we might want to try to mitigate the effect, either by improving the data collection process, by identifying and transforming the relevant features, or by changing the model.5 5 Of course, it could happen that the points with the highest value are of the same class, in which case we would use different tools to look for commonalities possibly causing the bad performance.

As a final example, consider a single misclassified data point $z.$ The influence function allows computing the most influential training points for $z.$ Upon locating them, these points can be explored with feature attribution methods to understand the cause of their influence and potentially improve the model.

2.2 Sensitivity / robustness analysis

In contrast to the previous application, it is possible to study the effect of the removal of highly valuable or influential samples. [Bro21A] show how the removal of very few points can completely reverse the conclusions of a linear regression analysis, even when abundant data are available.

More precisely, they show on real datasets how removing less than 1% of the data can flip the sign of, and confidence interval around, the parameters of a regression model.6 6 For applications of this most influential subset method to moderately-sized LLMs we refer to [Fis23I]. Using a first-order approximation to the influence function, they define a new notion of robustness and estimate a lower bound on the number of samples that one must remove in order to achieve any desired changes to the conclusions of an analysis, e.g., effect sizes and their signs, or arbitrary scalar functions of parametric estimators.7 7 In this context, effect refers to the magnitude of the parameters in a linear regression model.

Interestingly, the sample bound they obtain is unrelated to classical notions of robustness in statistics, i.e., it is not driven by model misspecification or outliers. Instead, it is roughly a function of the ratio between the uncertainty in the effect that one tries to estimate and the noise in the data. The consequence is that even correctly specified models and non-contaminated observations can yield models highly sensitive to the removal of very few data points if the estimands are small with respect to the noise.

If only a few data are very influential (the model is not robust in their sense), one has to ponder whether the modeling approach taken is sound, whether the data were properly gathered, or whether there is perhaps some intrinsic quality of the problem that requires further analysis.

3 Attacks

Valuation can help in the detection of manipulation, theft and contamination of data. Here, we mention just two applications.

3.1 Watermark removal

Watermarking in the context of ML consists in developing models whose origin can be ascertained, e.g., by testing them against specially crafted samples. This aids developers of proprietary models in finding out whether their licensed architecture and weights are being misused.8 8 For instance, [Adi18T] adds a so-called trigger set of random samples (e.g., abstract images) with random labels to the training set and trains the model to memorize it. Because of this random nature, if a deployed model correctly labels samples from this trigger set, it must be the one trained on them.

[Jia21S] suggests an attack against watermarking based on data valuation: points of low value in a training set are likely to be part of the watermarking mechanism since they don’t contribute to performance on a correct validation set (i.e., one without watermarks). Note that the usefulness of such an attack is debatable, as an attacker is unlikely to have access to the full training set used by the developers of a proprietary model. Nevertheless, the experimental results in [Jia21S] suggest that data values (and in particular surrogate ones that can be computed quickly) can help in identifying samples blatantly out of distribution, given the right conditions, thus supporting the use of valuation as a method for anomaly detection.

3.2 Poisoning attacks

[Koh17U] proposes employing the influence function to design training points that increase error. For example, in the context of i.i.d. parameter estimation for an r.v. $X$ with range in $\mathbb{R}^d,$ this means choosing a perturbation $\delta \in \mathbb{R}^d$ such that for a given influential point $x_{i},$ the shifted $x_{i} + \delta$ induces a large change in the estimator. The same idea applies for regression problems.

The feasibility of such an attack, which requires access to the model, training procedure, and data, is questionable. While one might consider using this method to strengthen a model’s robustness via adversarial training, it is unclear whether this particular form of adversarial examples would lead to good robustness and, crucially, whether such robustness is relevant across all applications.9 9 It’s worth mentioning the interesting work by [Tao20M] which conducts extensive experiments with hundreds of models in computer vision. They come to the conclusion that any ability to generalize to “natural” distribution shifts (e.g., data from the same source but collected differently, as opposed to synthetic modifications) comes at the price of reduced in-distribution performance.

4 Data markets

An escalating demand for data has long been observed across all industries, propelled by augmented data collection operations within organizations and from consumer devices. This surge has motivated the emergence of solutions to connect providers and consumers of data, incorporating mechanisms for economic compensation. Market pricing naturally depends on the value addition for the buyer (e.g., an expected increase in prediction accuracy), but it also considers the seller’s perspective differently across scenarios: In business-to-business (B2B), the price will also reflect the costs of the seller’s data acquisition. In a business-to-consumer (B2C) context, the price further accounts for the requisite level of privacy of the end-user generating the data.10 10 Additionally, the European Data Act requires individuals to have the right to choose data processors for any data harvested from them. Data markets might then include end-users as stakeholders.

Two-sided data markets are one of the applications first proposed for data valuation [Aga19M].11 11 Data markets can be classified into sell-side, buy-side, and two-sided markets [Zha23S]. In sell-side markets, data’s worth is gauged by the information it provides to consumers, e.g., by the expected gain in performance of some model or metric. In buy-side markets, data signifies an owner’s cost of acquisition or the value of their privacy, with different privacy concepts determining the measure of privacy loss. In two-sided markets, data holds value from both perspectives. The goal is to connect data providers with data consumers, either directly or through a broker. Examples include retail stores gathering customer data, logistics companies optimizing their warehouse planning, or radiology centers sharing data with developers of medical diagnosis software.

An application to marketplaces requires a notion of value that assigns “fair” prices to the data. The game theoretic value functions we discussed above, while providing a certain sense of fairness, have some limitations in this context, for instance, in that they do not intrinsically protect against duplicates and other adversarial behaviors. Nevertheless, there has been progress in this area using Shapley values in certain settings [Aga19M], and even in the context of federated learning [Wan20P], where order of arrival is important (as opposed to the usual assumption when using Shapley values). These works posit two-sided markets where there is a central data broker and must make certain simplifying assumptions, but there is an extensive literature on the subject.12 12 For a full review of different approaches, we refer to [Zha23S].

Valuation approaches like LAVA [Jus23L] or CRAIG [Mir20C] target a scenario where the model is not, and cannot be, given in advance, which can sometimes be a more convenient assumption in this context.13 13 Although, as mentioned, CRAIG is not strictly for valuation. In particular, consider the scenario where multiple data providers upload their data to a broker, who then sells access to it to a model developer. The broker can use valuation to set prices for the data without having access to the developer’s model, or having to fix a specific downstream task. In a different scenario, providers upload only samples of their data, which are then used by the developers to gauge their usefulness.

References

[Adi18T]

Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring, Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, Joseph Keshet.

2018

Deep Neural Networks have recently gained lots of success after enabling several breakthroughs in notoriously challenging problems. Training these networks is computationally expensive and requires vast amounts of training data. Selling such pre-trained models can, therefore, be a lucrative business model. Unfortunately, once the models are sold they can be easily copied and redistributed. To …

[Aga19M]

A Marketplace for Data: An Algorithmic Solution, Anish Agarwal, Munther Dahleh, Tuhin Sarkar.

Jun 2019

In this work, we aim to design a data marketplace; a robust real-time matching mechanism to efficiently buy and sell training data for Machine Learning tasks. While the monetization of data and pre-trained models is an essential focus of industry today, there does not exist a market mechanism to price training data and match buyers to sellers while still addressing the associated (computational …

[Bod21B]

Benchmarking and Survey of Explanation Methods for Black Box Models, Francesco Bodria, Fosca Giannotti, Riccardo Guidotti, Francesca Naretto, Dino Pedreschi, Salvatore Rinzivillo.

Feb 2021

The widespread adoption of black-box models in Artificial Intelligence has enhanced the need for explanation methods to reveal how these obscure models reach specific decisions. Retrieving explanations is fundamental to unveil possible biases and to resolve practical or ethical issues. Nowadays, the literature is full of methods with different explanations. We provide a categorization of …

[Bro21A]

An Automatic Finite-Sample Robustness Metric: When Can Dropping a Little Data Make a Big Difference?, Tamara Broderick, Ryan Giordano, Rachael Meager.

Nov 2021

We propose a method to assess the sensitivity of econometric analyses to the removal of a small fraction of the data. Manually checking the influence of all possible small subsets is computationally infeasible, so we provide an approximation to find the most influential subset. Our metric, the "Approximate Maximum Influence Perturbation," is automatically computable for common methods including …

[Deu21E]

Explanations for Data Repair Through Shapley Values, Daniel Deutch, Nave Frost, Amir Gilad, Oren Sheffer.

Oct 2021

Data repair, i.e., the identification and fix of errors in the data, is a central component of the Data Science cycle. As such, significant research effort has been devoted to automate the repair process. Yet it still requires significant manual labor by the Data Scientists, tweaking and optimizing repair modules (up to 80% of their time, according to surveys). To this end, we propose in this …

[Fis23I]

Influence Diagnostics under Self-concordance, Jillian Fisher, Lang Liu, Krishna Pillutla, Yejin Choi, Zaid Harchaoui.

Apr 2023

Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and …

[Gho19D]

Data Shapley: Equitable Valuation of Data for Machine Learning, Amirata Ghorbani, James Zou.

May 2019

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, …

[Gho22D]

Data Shapley Valuation for Efficient Batch Active Learning, Amirata Ghorbani, James Zou, Andre Esteva.

Oct 2022

Annotating the right set of data amongst all available data points is a key challenge in many machine learning applications. Batch active learning is a popular approach to address this, in which batches of unlabeled data points are selected for annotation, while an underlying learning algorithm gets subsequently updated. In this work, we introduce Active Data Shapley (ADS) –a filtering layer for …

[Gos19N]

Do Not Trust Additive Explanations, Alicja Gosiewska, Przemyslaw Biecek.

Mar 2019

Explainable Artificial Intelligence (XAI)has received a great deal of attention recently. Explainability is being presented as a remedy for the distrust of complex and opaque models. Model agnostic methods such as LIME, SHAP, or Break Down promise instance-level interpretability for any complex machine learning model. But how faithful are these additive explanations? Can we rely on additive …

[Jac21H]

How machine-learning recommendations influence clinician treatment selections: The example of antidepressant selection, Maia Jacobs, Melanie F. Pradier, Thomas H. McCoy, Roy H. Perlis, Finale Doshi-Velez, Krzysztof Z. Gajos.

Feb 2021

Decision support systems embodying machine learning models offer the promise of an improved standard of care for major depressive disorder, but little is known about how clinicians’ treatment decisions will be influenced by machine learning recommendations and explanations. We used a within-subject factorial experiment to present 220 clinicians with patient vignettes, each with or without a …

[Jia19aE]

Efficient task-specific data valuation for nearest neighbor algorithms, Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, Dawn Song.

Jul 2019

Given a data set D containing millions of data points and a data consumer who is willing to pay \$X to train a machine learning (ML) model over D, how should we distribute this \$X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, …

[Jia21S]

Scalability vs. Utility: Do We Have To Sacrifice One for the Other in Data Importance Quantification?, Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li, Dawn Song.

2021

Quantifying the importance of each training point to a learning task is a fundamental problem in machine learning and the estimated importance scores have been leveraged to guide a range of data workflows such as data summarization and domain adaption. One simple idea is to use the leave-one-out error of each training point to indicate its importance. Recent work has also proposed to use the …

[Jus23L]

LAVA: Data Valuation without Pre-Specified Learning Algorithms, Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia.

Feb 2023

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a …

[Koh17U]

Understanding Black-box Predictions via Influence Functions, Pang Wei Koh, Percy Liang.

Jul 2017

How can we explain the predictions of a black-box model? In this paper, we use influence functions — a classic technique from robust statistics — to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. To scale up influence functions to modern machine learning settings, we develop a …

[Kwo23D]

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value, Yongchan Kwon, James Zou.

Jul 2023

Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as …

[Liu21I]

Influence Selection for Active Learning, Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, Conghui He.

2021

[Mir20C]

Coresets for Data-efficient Training of Machine Learning Models, Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec.

Nov 2020

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method …

[Noh23D]

Data Valuation Without Training of a Model, Ki Nohyun, Hoyong Choi, Hye Won Chung.

Feb 2023

Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a …

[Rud19S]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Cynthia Rudin.

May 2019

Black box machine learning models are currently being used for high-stakes decision making throughout society, causing problems in healthcare, criminal justice and other domains. Some people hope that creating methods for explaining these black box models will alleviate some of the problems, but trying to explain black box models, rather than creating models that are interpretable in the first …

[Sch23D]

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values, Stephanie Schoch, Ritwick Mishra, Yangfeng Ji.

Jun 2023

Although Shapley values have been shown to be highly effective for identifying harmful training instances, dataset size and model complexity constraints limit the ability to apply Shapley-based data valuation to fine-tuning large pre-trained language models. To address this, we propose TS-DShapley, an algorithm that reduces computational cost of Shapley-based data valuation through: 1) an …

[Sor22N]

Beyond neural scaling laws: beating power law scaling via data pruning, Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos.

Oct 2022

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show both in theory and practice that we can break …

[Tan21D]

Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset, Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A Dunnmon, James Zou, Daniel L Rubin.

Apr 2021

The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically …

[Tao20M]

Measuring robustness to natural distribution shifts in image classification, Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt.

2020

We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 …

[Tra22P]

pyDVL: The Python Data Valuation Library, Team TransferLab.

2022

pyDVL collects algorithms for Data Valuation and Influence Function computation.

[Wan20P]

A Principled Approach to Data Valuation for Federated Learning, Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, Dawn Song.

2020

Federated learning (FL) is a popular technique to train machine learning (ML) models on decentralized data sources. In order to sustain long-term participation of data owners, it is important to fairly appraise each data source and compensate data owners for their contribution to the training process. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data …

[Wen22L]

Learning with not Enough Data Part 2: Active Learning, Lilian Weng.

Feb 2022

This is part 2 of what to do when facing a limited amount of labeled data for supervised learning tasks. This time we will get some amount of human labeling work involved, but within a budget limit, and therefore we need to be smart when selecting which samples to label.

[Zha23S]

A Survey of Data Pricing for Data Marketplaces, Mengxiao Zhang, Fernando Beltrán, Jiamou Liu.

Aug 2023

A data marketplace is an online venue that brings data owners, data brokers, and data consumers together and facilitates commoditisation of data amongst them. Data pricing, as a key function of a data marketplace, demands quantifying the monetary value of data. A considerable number of studies on data pricing can be found in literature. This article attempts to comprehensively review the …