ABSTRACT
Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift, where data issues are given the attention they deserve, for instance, proposing standard descriptions for datasets.
In this sense, and inspired by these proposals, we present a model-driven tool to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. Our tool aims to facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The tool is implemented with the Langium workbench as a Visual Studio Code plugin and published as an open-source.
- Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics 6 (2018), 587--604.Google ScholarCross Ref
- Jordi Cabot and Martin Gogolla. 2012. Object constraint language (OCL): a definitive guide. In International school on formal methods for the design of computer, communication and software systems. Springer, 58--90.Google Scholar
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86--92.Google ScholarDigital Library
- Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D Dhole, et al. 2021. The gem benchmark: Natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672 (2021).Google Scholar
- Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2022. A domain-specific language for describing machine learning datasets. Google ScholarCross Ref
- Ramanathan V Guha, Dan Brickley, and Steve Macbeth. 2016. Schema. org: evolution of structured data on the web. Commun. ACM 59, 2 (2016), 44--51.Google ScholarDigital Library
- Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2020. The dataset nutrition label. Data Protection and Privacy, Volume 12: Data Protection and Democracy 12 (2020), 1.Google ScholarCross Ref
- Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560--575.Google ScholarDigital Library
- Ashraf Khalil, Soha Glal Ahmed, Asad Masood Khattak, and Nabeel Al-Qirim. 2020. Investigating bias in facial analysis systems: A systematic review. IEEE Access 8 (2020), 130751--130761.Google ScholarCross Ref
- Daniel D McCracken and Edwin D Reilly. 2003. Backus-naur form (bnf). In Encyclopedia of Computer Science. 129--131.Google Scholar
- Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi, Sebastian Gehrmann, and Yacine Jernite. 2021. Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics. ACM, Online, 121--135.Google ScholarCross Ref
- Bo Pang and Lillian Lee. 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. 271--es.Google ScholarDigital Library
- Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336.Google ScholarCross Ref
- Cedric Renggli, Luka Rimanic, Nezihe Merve Gürel, Bojan Karlaš, Wentao Wu, and Ce Zhang. 2021. A Data Quality-Driven View of MLOps. Data Engineering (2021), 11.Google Scholar
Recommendations
DescribeML: A dataset description tool for machine learning▪
AbstractDatasets are essential for training and evaluating machine learning models. However, they are also the root cause of many undesirable model behaviors, such as biased predictions. To address this issue, the machine learning community is proposing ...
Highlights- A tool for describing dataset for machine learning.
- Describe the composition, provenance, and social concerns of the data used to train ML models.
- Provide a set of language features and IDE extension to facilitate the dataset ...
Reward Shaping in Episodic Reinforcement Learning
AAMAS '17: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent SystemsRecent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of ...
Comments