Data sheets for datasets

Standardization is a fundamental part of many industries, like electronics, medicine or construction. Components, processes and systems are accompanied by data sheets and precise indications as to composition, parameters for intended use or application. For the two staples of the ML industry, models and data, there has only recently been some progress in this direction.

A few years ago, model cards where introduced by google and have seen moderate acceptance. But data being what ultimately determines the success of a ML system ought to receive perhaps more attention. The process by which it was gathered and who funded it can shed light on hidden biases, societal or otherwise. Unacknowledged pre-processing steps and choices made by different stakeholders can negatively affect performance at deployment. Matters of distribution and maintenance might disqualify a dataset for a particular project.

The authors of [Geb21D], in collaboration with product teams, scientists and lawyers have developed a questionnaire which will help any team collecting data and any using it to address these questions. A valuable resource for any organization taking their data seriously. You can download a template here.

References

  • Datasheets for datasets, Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford. Communications of the ACM (2021)