Skip to main content
Erschienen in: Empirical Software Engineering 7/2022

01.12.2022

Extracting enhanced artificial intelligence model metadata from software repositories

verfasst von: Jason Tsay, Alan Braz, Martin Hirzel, Avraham Shinnar, Todd Mummert

Erschienen in: Empirical Software Engineering | Ausgabe 7/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

While artificial intelligence (AI) models have improved at understanding large-scale data, understanding AI models themselves at any scale is difficult. For example, even two models that implement the same network architecture may differ in frameworks, datasets, or even domains. Furthermore, attempting to use either model often requires much manual effort to understand it. As software engineering and AI development share many of the same languages and tools, techniques in mining software repositories should enable more scalable insights into AI models and AI development. However, much of the relevant metadata around models are not easily extractable. This paper (an extension of our MSR 2020 paper) presents a library called AIMMX for AI Model Metadata eXtraction from software repositories into enhanced metadata that conforms to a flexible metadata schema. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. We also explored how AIMMX can enable studies and tools to advance engineering support for AI development. As preliminary examples, we present an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. We also demonstrate the flexibility of extracted metadata by using the evaluation dataset in an existing natural language processing (NLP) analysis platform to identify trends in the dataset. Overall, we hope AIMMX fosters research towards better AI development.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Baudart G, Hirzel M, Kate K, Ram P, Shinnar A, Tsay J (2021) Pipeline combinators for gradual autoML. In: Advances in neural information processing systems (neurIPS) Baudart G, Hirzel M, Kate K, Ram P, Shinnar A, Tsay J (2021) Pipeline combinators for gradual autoML. In: Advances in neural information processing systems (neurIPS)
Zurück zum Zitat Baudart G, Kirchner P, Hirzel M, Kate K (2020) Mining documentation to extract hyperparameter schemas. In: ICML Workshop on automated machine learning (autoML@ICML). arXiv:2006.16984 Baudart G, Kirchner P, Hirzel M, Kate K (2020) Mining documentation to extract hyperparameter schemas. In: ICML Workshop on automated machine learning (autoML@ICML). arXiv:2006.​16984
Zurück zum Zitat Braiek H B, Khomh F, Adams B (2018) The Open-Closed principle of modern machine learning frameworks. In: Conference on mining software repositories (MSR), pp 353–363 Braiek H B, Khomh F, Adams B (2018) The Open-Closed principle of modern machine learning frameworks. In: Conference on mining software repositories (MSR), pp 353–363
Zurück zum Zitat Breck E, Polyzotis N, Roy S, Whang S E, Zinkevich M (2019) Data validation for machine learning. In: Conference on systems and machine learning (sysML) Breck E, Polyzotis N, Roy S, Whang S E, Zinkevich M (2019) Data validation for machine learning. In: Conference on systems and machine learning (sysML)
Zurück zum Zitat Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005 Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.​3005
Zurück zum Zitat Conneau A, Schwenk H, Cun Y, Barrault L (2017) Very deep convolutional networks for text classification. In: Long papers—continued, 15th conference of the European chapter of the Association for Computational Linguistics, EACL 2017—Proceedings of conference. Publisher Copyright: Ⓒ 2017 Association for Computational Linguistics; 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017; Conference date: 03-04-2017 Through 07-04-2017. Association for Computational Linguistics (ACL), pp 1107–1116 Conneau A, Schwenk H, Cun Y, Barrault L (2017) Very deep convolutional networks for text classification. In: Long papers—continued, 15th conference of the European chapter of the Association for Computational Linguistics, EACL 2017—Proceedings of conference. Publisher Copyright: Ⓒ 2017 Association for Computational Linguistics; 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017; Conference date: 03-04-2017 Through 07-04-2017. Association for Computational Linguistics (ACL), pp 1107–1116
Zurück zum Zitat Devlin J, Chang M W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 Devlin J, Chang M W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.​04805
Zurück zum Zitat Gonzalez D, Zimmermann T, Nagappan N (2020) The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://doi.org/10.1145/3379597.3387473. Association for Computing Machinery, New York, pp 431–442 Gonzalez D, Zimmermann T, Nagappan N (2020) The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://​doi.​org/​10.​1145/​3379597.​3387473. Association for Computing Machinery, New York, pp 431–442
Zurück zum Zitat Guazzelli A, Zeller M, Lin W C, Williams G, et al. (2009) Pmml: an open standard for sharing models. R J 1(1):60–65CrossRef Guazzelli A, Zeller M, Lin W C, Williams G, et al. (2009) Pmml: an open standard for sharing models. R J 1(1):60–65CrossRef
Zurück zum Zitat Gundersen O E, Kjensmo S (2017) State of the art: reproducibility in artificial intelligence. In: Conference on artificial intelligence (AAAI) Gundersen O E, Kjensmo S (2017) State of the art: reproducibility in artificial intelligence. In: Conference on artificial intelligence (AAAI)
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Zurück zum Zitat Hill C, Bellamy R, Erickson T, Burnett M (2016) Trials and tribulations of developers of intelligent systems: a field study. In: Symposium on visual languages and human-centric computing (VL/HCC), pp 162–170 Hill C, Bellamy R, Erickson T, Burnett M (2016) Trials and tribulations of developers of intelligent systems: a field study. In: Symposium on visual languages and human-centric computing (VL/HCC), pp 162–170
Zurück zum Zitat Ma Y, Fakhoury S, Christensen M, Arnaoudova V, Zogaan W, Mirakhorli M (2018) Automatic classification of software artifacts in Open-Source applications. In: Conference on mining software repositories (MSR), pp 414–425 Ma Y, Fakhoury S, Christensen M, Arnaoudova V, Zogaan W, Mirakhorli M (2018) Automatic classification of software artifacts in Open-Source applications. In: Conference on mining software repositories (MSR), pp 414–425
Zurück zum Zitat Miao H, Li A, Davis L S, Deshpande A (2016) ModelHub: towards unified data and lifecycle management for deep learning. CoRR. arXiv:1611.06224 Miao H, Li A, Davis L S, Deshpande A (2016) ModelHub: towards unified data and lifecycle management for deep learning. CoRR. arXiv:1611.​06224
Zurück zum Zitat Publio G C, Esteves D, ŁAwrynowicz A, Panov P, Soldatova L, Soru T, Vanschoren J, Zafar H (2018) ML schema: exposing the semantics of machine learning with schemas and ontologies. In: Reproducibility in machine learning workshop (RML). https://openreview.net/forum?id=B1e8MrXVxQ Publio G C, Esteves D, ŁAwrynowicz A, Panov P, Soldatova L, Soru T, Vanschoren J, Zafar H (2018) ML schema: exposing the semantics of machine learning with schemas and ontologies. In: Reproducibility in machine learning workshop (RML). https://​openreview.​net/​forum?​id=​B1e8MrXVxQ
Zurück zum Zitat Ronneberger O, Fischer P, Brox T Navab N, Hornegger J, Wells WM, Frangi AF (eds) (2015) U-Net: convolutional networks for biomedical image segmentation. Springer International Publishing, Cham Ronneberger O, Fischer P, Brox T Navab N, Hornegger J, Wells WM, Frangi AF (eds) (2015) U-Net: convolutional networks for biomedical image segmentation. Springer International Publishing, Cham
Zurück zum Zitat Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. In: Conference on neural information processing systems (NIPS), pp 2503–2511 Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. In: Conference on neural information processing systems (NIPS), pp 2503–2511
Zurück zum Zitat Szegedy C, Ioffe S, Vanhoucke V, Alemi A A (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Conference on artificial intelligence (AAAI) Szegedy C, Ioffe S, Vanhoucke V, Alemi A A (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Conference on artificial intelligence (AAAI)
Zurück zum Zitat Tramèr F, Zhang F, Juels A, Reiter M K, Ristenpart T (2016) Stealing machine learning models via prediction APIs. In: USENIX security symposium, pp 601–618 Tramèr F, Zhang F, Juels A, Reiter M K, Ristenpart T (2016) Stealing machine learning models via prediction APIs. In: USENIX security symposium, pp 601–618
Zurück zum Zitat Tsay J, Mummert T, Bobroff N, Braz A, Hirzel M (2018) Runway: machine learning model experiment management tool. In: Conference on systems and machine learning (sysML) Tsay J, Mummert T, Bobroff N, Braz A, Hirzel M (2018) Runway: machine learning model experiment management tool. In: Conference on systems and machine learning (sysML)
Zurück zum Zitat Tsay J, Braz A, Hirzel M, Shinnar A, Mummert T (2020) Aimmx: artificial intelligence model metadata extractor. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://doi.org/10.1145/3379597.3387448. Association for Computing Machinery, New York, pp 81–92 Tsay J, Braz A, Hirzel M, Shinnar A, Mummert T (2020) Aimmx: artificial intelligence model metadata extractor. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://​doi.​org/​10.​1145/​3379597.​3387448. Association for Computing Machinery, New York, pp 81–92
Zurück zum Zitat Witten I H, Frank E, Hall M A, Pal C J (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Witten I H, Frank E, Hall M A, Pal C J (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Metadaten
Titel
Extracting enhanced artificial intelligence model metadata from software repositories
verfasst von
Jason Tsay
Alan Braz
Martin Hirzel
Avraham Shinnar
Todd Mummert
Publikationsdatum
01.12.2022
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 7/2022
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10206-6

Weitere Artikel der Ausgabe 7/2022

Empirical Software Engineering 7/2022 Zur Ausgabe

Premium Partner