Top

Empirical Software Engineering

Published in:

01-12-2022

Extracting enhanced artificial intelligence model metadata from software repositories

Authors: Jason Tsay, Alan Braz, Martin Hirzel, Avraham Shinnar, Todd Mummert

Published in: Empirical Software Engineering | Issue 7/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

While artificial intelligence (AI) models have improved at understanding large-scale data, understanding AI models themselves at any scale is difficult. For example, even two models that implement the same network architecture may differ in frameworks, datasets, or even domains. Furthermore, attempting to use either model often requires much manual effort to understand it. As software engineering and AI development share many of the same languages and tools, techniques in mining software repositories should enable more scalable insights into AI models and AI development. However, much of the relevant metadata around models are not easily extractable. This paper (an extension of our MSR 2020 paper) presents a library called AIMMX for AI Model Metadata eXtraction from software repositories into enhanced metadata that conforms to a flexible metadata schema. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. We also explored how AIMMX can enable studies and tools to advance engineering support for AI development. As preliminary examples, we present an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. We also demonstrate the flexibility of extracted metadata by using the evaluation dataset in an existing natural language processing (NLP) analysis platform to identify trends in the dataset. Overall, we hope AIMMX fosters research towards better AI development.

previous article Sources of software development task friction

next article The evolution of the code during review: an investigation on review changes

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

At the time of publishing, their data are available under the CC BY-SA license.

https://cloud.ibm.com/catalog/services/natural-language-understanding

https://github.com/IBM/github-crawler

https://ibm.biz/ai-model-catalog-emse

Ajv (2018) Ajv: another JSON schema validator. https://ajv.js.org/ (Retrieved September 2018)

Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: International conference on software engineering: software engineering in practice (ICSE-SEIP). https://doi.org/10.1109/ICSE-SEIP.2019.00042https://doi.org/10.1109/ICSE-SEIP.2019.00042, pp 291–300

Apache (2019) Apache CouchDB. https://couchdb.apache.org. Accessed 21 Jan 2019

Archive G (2021) GH Archive. https://www.gharchive.org/. Accessed 27 Oct 2021

arXiv (1991) arXiv.org e-Print archive. https://arxiv.org/. Accessed 13 Mar 2020

arXiv (2018) arXiv.org help - arXiv API. https://arxiv.org/help/api/index. Accessed 13 Mar 2020

Augustsson L (1998) Cayenne—a language with dependent types. In: International conference on functional programming (ICFP). http://doi.acm.org/10.1145/289423.289451, pp 239–250

Bangash A A, Sahar H, Chowdhury S, Wong A W, Hindle A, Ali K (2019) What do developers know about machine learning: a study of ML discussions on StackOverflow. In: Conference on mining software repositories (MSR). https://doi.org/10.1109/MSR.2019.00052, pp 260–264

Baudart G, Hirzel M, Kate K, Ram P, Shinnar A (2020) Lale: consistent automated machine learning. In: KDD workshop on automation in machine learning (AutoML@KDD). arXiv:https://arxiv.org/abs/2007.01977

Baudart G, Hirzel M, Kate K, Ram P, Shinnar A, Tsay J (2021) Pipeline combinators for gradual autoML. In: Advances in neural information processing systems (neurIPS)

Baudart G, Kirchner P, Hirzel M, Kate K (2020) Mining documentation to extract hyperparameter schemas. In: ICML Workshop on automated machine learning (autoML@ICML). arXiv:2006.16984

Braiek H B, Khomh F, Adams B (2018) The Open-Closed principle of modern machine learning frameworks. In: Conference on mining software repositories (MSR), pp 353–363

Breck E, Polyzotis N, Roy S, Whang S E, Zinkevich M (2019) Data validation for machine learning. In: Conference on systems and machine learning (sysML)

Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005

Code PW (2020) Papers with code: the latest in machine learning. https://paperswithcode.com. Accessed 13 Mar 2020

Conneau A, Schwenk H, Cun Y, Barrault L (2017) Very deep convolutional networks for text classification. In: Long papers—continued, 15th conference of the European chapter of the Association for Computational Linguistics, EACL 2017—Proceedings of conference. Publisher Copyright: Ⓒ 2017 Association for Computational Linguistics; 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017; Conference date: 03-04-2017 Through 07-04-2017. Association for Computational Linguistics (ACL), pp 1107–1116

Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in GitHub: transparency and collaboration in an open software repository. In: Conference on computer supported cooperative work (CSCW). https://doi.org/10.1145/2145204.2145396, pp 1277–1286

Devlin J, Chang M W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

GitHub (2016) GitHub API v3 | GitHub Developer Guide. https://developer.github.com/v3/. Accessed 13 Mar 2020

GitHub (2020) The world’s leading software development platform–GitHub. https://github.com/. Accessed 13 Mar 2020

Gonzalez D, Zimmermann T, Nagappan N (2020) The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://doi.org/10.1145/3379597.3387473. Association for Computing Machinery, New York, pp 431–442

Gousios G (2013) The ghtorrent dataset and tool suite. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13. IEEE Press, Piscataway, pp 233–236. http://dl.acm.org/citation.cfm?id=2487085.2487132

Graves T L, Karr A F, Marron J S, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661. https://doi.org/10.1109/32.859533CrossRef

Guazzelli A, Zeller M, Lin W C, Williams G, et al. (2009) Pmml: an open standard for sharing models. R J 1(1):60–65CrossRef

Gundersen O E, Kjensmo S (2017) State of the art: reproducibility in artificial intelligence. In: Conference on artificial intelligence (AAAI)

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

Hill C, Bellamy R, Erickson T, Burnett M (2016) Trials and tribulations of developers of intelligent systems: a field study. In: Symposium on visual languages and human-centric computing (VL/HCC), pp 162–170

Hummer W, Muthusamy V, Rausch T, Dube P, El Maghraoui K, Murthi A, Oum P (2019) ModelOps: cloud-based lifecycle management for reliable and trusted AI. In: 2019 IEEE International conference on cloud engineering (IC2e). https://doi.org/10.1109/IC2E.2019.00025, pp 113–120

IBM (2020) Watson Discovery product page. https://www.ibm.com/cloud/watson-discovery. Accessed 12 Nov 2020

Internet Engineering Task Force (2018) JSON Schema specification. http://json-schema.org/specification.html. (Retrieved September 2018)

Kalliamvakou E, Gousios G, Blincoe K, Singer L, German D M, Damian D (2014) The promises and perils of mining GitHub. In: Conference on mining software repositories (MSR). http://doi.acm.org/10.1145/2597073.2597074, pp 92–101

Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: International conference on software engineering (ICSE). http://doi.acm.org/10.1145/2884781.2884783, pp 96–107

Lucene A (2018) https://lucene.apache.org/. Accessed 23 Feb 2018

Ma Y, Fakhoury S, Christensen M, Arnaoudova V, Zogaan W, Mirakhorli M (2018) Automatic classification of software artifacts in Open-Source applications. In: Conference on mining software repositories (MSR), pp 414–425

Menzies T, Zimmermann T (2013) Software analytics: so what? IEEE Softw 30(4):31–37. https://doi.org/10.1109/MS.2013.86https://doi.org/10.1109/MS.2013.86CrossRef

Miao H, Li A, Davis L S, Deshpande A (2016) ModelHub: towards unified data and lifecycle management for deep learning. CoRR. arXiv:1611.06224

Miao H, Li A, Davis L S, Deshpande A (2017) On model discovery for hosted data science projects. In: Workshop on data management for end-to-end machine learning, DEEM’17. http://doi.acm.org/10.1145/3076246.3076252, pp 6:1–6:4

MLFlow (2019) MLFlow—a platform for the machine learning lifecycle. https://mlflow.org/. Accessed 13 Mar 2020

ONNX (2017) ONNX. https://onnx.ai/. Accessed 13 Mar 2020

Ostrand T J, Weyuker E J, Bell R M (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31 (4):340–355. https://doi.org/10.1109/TSE.2005.49CrossRef

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems 32. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. Curran Associates, Inc., pp 8024–8035

Pezoa F, Reutter J L, Suarez F, Ugarte M, Vrgoč D. (2016) Foundations of JSON schema. In: International conference on world wide web (WWW). https://doi.org/10.1145/2872427.2883029, pp 263–273

Pimentel J F, Murta L, Braganholo V, Freire J (2019) A large-scale study about quality and reproducibility of jupyter notebooks. In: Conference on mining software repositories (MSR). https://doi.org/10.1109/MSR.2019.00077, pp 507–517

Pivarski J, Bennett C, Grossman R L (2016) Deploying analytics with the portable format for analytics (pfa). In: Conference on knowledge discovery and data mining (KDD). http://doi.acm.org/10.1145/2939672.2939731, pp 579–588

Publio G C, Esteves D, ŁAwrynowicz A, Panov P, Soldatova L, Soru T, Vanschoren J, Zafar H (2018) ML schema: exposing the semantics of machine learning with schemas and ontologies. In: Reproducibility in machine learning workshop (RML). https://openreview.net/forum?id=B1e8MrXVxQ

Rak-amnouykit I, Milanova A, Baudart G, Hirzel M, Dolby J (2021) Extracting hyperparameter constraints from code. In: ICLR Workshop on security and safety in machine learning systems (secML@ICLR). https://aisecure-workshop.github.io/aml-iclr2021/papers/18.pdf

Rodríguez C, Baez M, Daniel F, Casati F, Trabucco J C, Canali L, Percannella G (2016) REST APIS: a large-scale analysis of compliance with principles and best practices. In: International conference on web engineering (ICWE). https://doi.org/10.1007/978-3-319-38791-8_2, pp 21–39

Ronneberger O, Fischer P, Brox T Navab N, Hornegger J, Wells WM, Frangi AF (eds) (2015) U-Net: convolutional networks for biomedical image segmentation. Springer International Publishing, Cham

Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo J F, Dennison D (2015) Hidden technical debt in machine learning systems. In: Conference on neural information processing systems (NIPS), pp 2503–2511

Sethi A, Sankaran A, Panwar N, Khare S, Mani S (2018) Dlpaper2code: auto-generation of code from deep learning research papers. In: Conference on artificial intelligence (AAAI). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17100, pp 7339–7346

Shah N (2019) ARXIV data from 24,000+ papers Version 2. https://www.kaggle.com/neelshah18/arxivdataset/home. Accessed 15 Jan 2019

Shaikh S, Vishwakarma H, Mehta S, Varshney K R, Ramamurthy K N, Wei D (2017) An end-to-end machine learning pipeline that ensures fairness policies. In: Data for good exchange. https://arxiv.org/abs/1710.06876

Smith M J, Sala C, Kanter J M, Veeramachaneni K (2019) The machine learning bazaar: harnessing the ML ecosystem for effective system development. https://arxiv.org/abs/1905.08942

Szegedy C, Ioffe S, Vanhoucke V, Alemi A A (2017) Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Conference on artificial intelligence (AAAI)

Trainer E H, Chaihirunkarn C, Kalyanasundaram A, Herbsleb J D (2015) From personal tool to community resource: what’s the extra work and who will do it?. In: Conference on computer supported cooperative work (CSCW). http://doi.acm.org/10.1145/2675133.2675172, pp 417–430

Tramèr F, Zhang F, Juels A, Reiter M K, Ristenpart T (2016) Stealing machine learning models via prediction APIs. In: USENIX security symposium, pp 601–618

Tsay J, Mummert T, Bobroff N, Braz A, Hirzel M (2018) Runway: machine learning model experiment management tool. In: Conference on systems and machine learning (sysML)

Tsay J, Braz A, Hirzel M, Shinnar A, Mummert T (2020) Aimmx: artificial intelligence model metadata extractor. In: Proceedings of the 17th international conference on mining software repositories, MSR ’20. https://doi.org/10.1145/3379597.3387448. Association for Computing Machinery, New York, pp 81–92

Vanschoren J, van Rijn J N, Bischl B, Torgo L (2014) openML: networked science in machine learning. SIGKDD Explor Newsl 15(2):49–60. http://doi.acm.org/10.1145/2641190.2641198CrossRef

Vartak M, Subramanyam H, Lee W E, Viswanathan S, Husnoo S, Madden S, Zaharia M (2016) ModelDB: a system for machine learning model management. In: Workshop on human-in-the-loop data analytics (HILDA). http://doi.acm.org/10.1145/2939502.2939516, pp 14:1–14:3

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser LU, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Vaziri M, Mandel L, Shinnar A, Siméon J, Hirzel M (2017) Generating chat bots from web api specifications. In: Symposium on new ideas, new paradigms, and reflections on programming and software (Onward!). http://doi.acm.org/10.1145/3133850.3133864, pp 44–57

Wan Z, Xia X, Lo D, Murphy G C (2019) How does machine learning change software development practices? IEEE Trans Softw Eng 1. https://doi.org/10.1109/TSE.2019.2937083

Witten I H, Frank E, Hall M A, Pal C J (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann

Title: Extracting enhanced artificial intelligence model metadata from software repositories
Authors: Jason Tsay
Alan Braz
Martin Hirzel
Avraham Shinnar
Todd Mummert
Publication date: 01-12-2022
Publisher: Springer US
Published in: Empirical Software Engineering / Issue 7/2022
Print ISSN: 1382-3256
Electronic ISSN: 1573-7616
DOI: https://doi.org/10.1007/s10664-022-10206-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 7/2022

How Scrum adds value to achieving software quality?

Modeling function-level interactions for file-level bug localization

On the documentation of self-admitted technical debt in issues

A large scale analysis of mHealth app user reviews

SmartFast: an accurate and robust formal analysis tool for Ethereum smart contracts

An empirical evaluation of a novel domain-specific language – modelling vehicle routing problems with Athos

Premium Partner