research-article

DescribeML: a tool for describing machine learning datasets

Authors:
Joan Giner-Miguelez

Universitat Oberta de Catalunya, UOC, Barcelona, Spain

Universitat Oberta de Catalunya, UOC, Barcelona, Spain
View Profile

,
Abel Gómez

Universitat Oberta de Cataluny, UOC, Barcelona, Spain

Universitat Oberta de Cataluny, UOC, Barcelona, Spain
View Profile

,
Jordi Cabot

Universitat Oberta de Cataluny, UOC, Barcelona, Spain

Universitat Oberta de Cataluny, UOC, Barcelona, Spain
View Profile

MODELS '22: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion ProceedingsOctober 2022Pages 22–26https://doi.org/10.1145/3550356.3559087

Published:09 November 2022Publication History

MODELS '22: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings

Pages 22–26

ABSTRACT

Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets.

In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.

References

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6 (2018), 587--604.Google ScholarCross Ref
Jordi Cabot and Martin Gogolla. 2012. Object constraint language (OCL): a definitive guide. In International school on formal methods for the design of computer, communication and software systems. Springer, 58--90.Google Scholar
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86--92.Google ScholarDigital Library
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D Dhole, et al. 2021. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672 (2021).Google Scholar
Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2022. A domain-specific language for describing machine learning datasets. Google ScholarCross Ref
Ramanathan V Guha, Dan Brickley, and Steve Macbeth. 2016. Schema. org: evolution of structured data on the web. Commun. ACM 59, 2 (2016), 44--51.Google ScholarDigital Library
Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2020. The dataset nutrition label. Data Protection and Privacy, Volume 12: Data Protection and Democracy 12 (2020), 1.Google ScholarCross Ref
Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560--575.Google ScholarDigital Library
Ashraf Khalil, Soha Glal Ahmed, Asad Masood Khattak, and Nabeel Al-Qirim. 2020. Investigating bias in facial analysis systems: A systematic review. IEEE Access 8 (2020), 130751--130761.Google ScholarCross Ref
Daniel D McCracken and Edwin D Reilly. 2003. Backus-naur form (bnf). In Encyclopedia of Computer Science. 129--131.Google Scholar
Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi, Sebastian Gehrmann, and Yacine Jernite. 2021. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics. ACM, Online, 121--135.Google ScholarCross Ref
Bo Pang and Lillian Lee. 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. 271--es.Google ScholarDigital Library
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336.Google ScholarCross Ref
Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš, Wentao Wu, and Ce Zhang. 2021. A Data Quality-Driven View of MLOps. Data Engineering (2021), 11.Google Scholar

Recommendations

DescribeML: A dataset description tool for machine learning▪
Abstract
Datasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing ...
Highlights
- A tool for describing dataset for machine learning.
- Describe the composition, provenance, and social concerns of the data used to train ML models.
- Provide a set of language features and IDE extension to facilitate the dataset ...
Read More
Lifelong Machine Learning
Read More
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MODELS '22: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings
October 2022
1003 pages
ISBN:9781450394673
DOI:10.1145/3550356
Conference Chairs:
Thomas Kühn
Karlsruhe Institute of Technology, Germany
,
Vasco Sousa
University of Montréal, Canada
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate118of382submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 140
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DescribeML: a tool for describing machine learning datasets

MODELS '22: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings

ABSTRACT

References

Cited By

Recommendations

DescribeML: A dataset description tool for machine learning▪

Lifelong Machine Learning

Reward Shaping in Episodic Reinforcement Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DescribeML: a tool for describing machine learning datasets

MODELS '22: Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings

ABSTRACT

References

Cited By

Recommendations

DescribeML: A dataset description tool for machine learning▪

Lifelong Machine Learning

Reward Shaping in Episodic Reinforcement Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media