LAVA: Data Valuation Without Pre-Specified Learning Algorithms

Abstract

We introduce LAVA [Jus23L], a novel approach to Data Valuation (DV) that can value training data in a way that is oblivious to the downstream learning algorithm, and some recent extensions. Our main results are as follows:

We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance.
We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.

Bio

Feiyang Kang is Ph.D. student at Virginia Tech under the supervision of Prof. Ruoxi Jia. His focus lies in Data-centric AI and trustworthy ML and has published work in Data valuation using optimal transport, influence functions, and data acquisition among others.

Follow Feiyang on LinkedIn or check his Google Scholar profile for updates on his work.

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a …

Abstract

Bio

References