skip to main content
10.1145/3379597.3387448acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

AIMMX: Artificial Intelligence Model Metadata Extractor

Published:18 September 2020Publication History

ABSTRACT

Despite all of the power that machine learning and artificial intelligence (AI) models bring to applications, much of AI development is currently a fairly ad hoc process. Software engineering and AI development share many of the same languages and tools, but AI development as an engineering practice is still in early stages. Mining software repositories of AI models enables insight into the current state of AI development. However, much of the relevant metadata around models are not easily extractable directly from repositories and require deduction or domain knowledge. This paper presents a library called AIMMX that enables simplified AI Model Metadata eXtraction from software repositories. The extractors have five modules for extracting AI model-specific metadata: model name, associated datasets, references, AI frameworks used, and model domain. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. Our platform extracted metadata with 87% precision and 83% recall. As preliminary examples of how AI model metadata extraction enables studies and tools to advance engineering support for AI development, this paper presents an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. Our analysis suggests that while data reproducibility may be relatively poor with 42% of models in our sample citing their datasets, method reproducibility is more common at 72% of models in our sample, particularly state-of-the-art models. Our collected models are searchable in a catalog that uses existing metadata to enable advanced discovery features for efficiently finding models.

References

  1. [n.d.]. arXiv.org help - arXiv API. https://arxiv.org/help/api/index Accessed: 2020-03-13.Google ScholarGoogle Scholar
  2. [n.d.]. Papers With Code: the latest in machine learning. https://paperswithcode.com Accessed: 2020-03-13.Google ScholarGoogle Scholar
  3. 1991. arXiv.org e-Print archive. https://arxiv.org/Accessed: 2020-03-13.Google ScholarGoogle Scholar
  4. 2008. The world's leading software development platform - GitHub. https://github.com/ Accessed: 2020-03-13.Google ScholarGoogle Scholar
  5. 2017. ONNX. https://onnx.ai/ Accessed: 2020-03-13.Google ScholarGoogle Scholar
  6. 2019. MLFlow- A platform for the machine learning lifecycle. https://mlflow.org/ Accessed: 2020-03-13.Google ScholarGoogle Scholar
  7. Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291--300. https://doi.org/10.1109/ICSE-SEIP.2019.00042Google ScholarGoogle Scholar
  8. Abdul Ali Bangash, Hareem Sahar, Shaiful Chowdhury, Alexander William Wong, Abram Hindle, and Karim Ali. 2019. What Do Developers Know about Machine Learning: A Study of ML Discussions on StackOverflow. In Conference on Mining Software Repositories (MSR). 260--264. https://doi.org/10.1109/MSR.2019.00052Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H Ben Braiek, F Khomh, and B Adams. 2018. The Open-Closed Principle of Modern Machine Learning Frameworks. In Conference on Mining Software Repositories (MSR). 353--363.Google ScholarGoogle Scholar
  10. Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).Google ScholarGoogle Scholar
  11. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. 2013. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. CoRR abs/1312.3005 (2013). arXiv:1312.3005 http://arxiv.org/abs/1312.3005Google ScholarGoogle Scholar
  12. Laura Dabbish, Colleen Stuart, Jason Tsay, and Jim Herbsleb. 2012. Social coding in GitHub: transparency and collaboration in an open software repository. In Conference on Computer Supported Cooperative Work (CSCW). 1277--1286. https://doi.org/10.1145/2145204.2145396Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  14. GitHub. 2016. GitHub API v3 | GitHub Developer Guide. https://developer.github.com/v3/ Accessed: 2020-03-13.Google ScholarGoogle Scholar
  15. T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. 2000. Predicting fault incidence using software change history. IEEE Transactions on Software Engineering 26, 7 (July 2000), 653--661. https://doi.org/10.1109/32.859533Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alex Guazzelli, Michael Zeller, Wen-Ching Lin, Graham Williams, et al. 2009. PMML: An open standard for sharing models. The R Journal 1, 1 (2009), 60--65.Google ScholarGoogle ScholarCross RefCross Ref
  17. Odd Erik Gundersen and Sigbjørn Kjensmo. 2017. State of the art: Reproducibility in artificial intelligence. In Conference on Artificial Intelligence (AAAI).Google ScholarGoogle Scholar
  18. CharlesHill, Rachel Bellamy, Thomas Erickson, and Margaret Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 162--170.Google ScholarGoogle Scholar
  19. Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Conference on Mining Software Repositories (MSR). 92--101. https://doi.org/10.1145/2597073.2597074Google ScholarGoogle Scholar
  20. Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In International Conference on Software Engineering (ICSE). 96--107. http://doi.acm.org/10.1145/2884781.2884783Google ScholarGoogle Scholar
  21. Y Ma, S Fakhoury, M Christensen, V Arnaoudova, W Zogaan, and M Mirakhorli. 2018. Automatic Classification of Software Artifacts in Open-Source Applications. In Conference on Mining Software Repositories (MSR). 414--425.Google ScholarGoogle Scholar
  22. T. Menzies and T. Zimmermann. 2013. Software Analytics: So What? IEEE Software 30, 4 (July 2013), 31--37. https://doi.org/10.1109/MS.2013.86Google ScholarGoogle Scholar
  23. Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. 2016. ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning. CoRR abs/1611.06224 (2016). https://arxiv.org/abs/1611.06224Google ScholarGoogle Scholar
  24. Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. On Model Discovery For Hosted Data Science Projects. In Workshop on Data Management for End-to-End Machine Learning (DEEM'17). 6:1---6:4. https://doi.org/10.1145/3076246.3076252Google ScholarGoogle Scholar
  25. T. J. Ostrand, E. J. Weyuker, and R. M. Bell. 2005. Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering 31, 4 (April 2005), 340--355. https://doi.org/10.1109/TSE.2005.49Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. A Large-Scale Study about Quality and Reproducibility of Jupyter Notebooks. In Conference on Mining Software Repositories (MSR). 507--517. https://doi.org/10.1109/MSR.2019.00077Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jim Pivarski, Collin Bennett, and Robert L. Grossman. 2016. Deploying Analytics with the Portable Format for Analytics (PFA). In Conference on Knowledge Discovery and Data Mining (KDD) (San Francisco, California, USA). 579--588. https://doi.org/10.1145/2939672.2939731Google ScholarGoogle Scholar
  28. Gustavo Correa Publio, Diego Esteves, Agnieszka ÅĄawrynowicz, PanÄŊe Panov, Larisa Soldatova, Tommaso Soru, Joaquin Vanschoren, and Hamid Zafar. 2018. ML Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies. In Reproducibility in Machine Learning Workshop (RML). https://openreview.net/forum?id=B1e8MrXVxQGoogle ScholarGoogle Scholar
  29. D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In Conference on Neural Information Processing Systems (NIPS). 2503--2511.Google ScholarGoogle Scholar
  30. Akshay Sethi, Anush Sankaran, Naveen Panwar, Shreya Khare, and Senthil Mani. 2018. DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers. In Conference on Artificial Intelligence (AAAI). 7339--7346. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17100Google ScholarGoogle Scholar
  31. Neel Shah. [n.d.]. ARXIV data from 24,000+ papers Version 2. https://www.kaggle.com/neelshah18/arxivdataset/home Accessed: 2019-01-15.Google ScholarGoogle Scholar
  32. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Conference on Artificial Intelligence (AAAI).Google ScholarGoogle Scholar
  33. Erik H. Trainer, Chalalai Chaihirunkarn, Arun Kalyanasundaram, and James D. Herbsleb. 2015. From Personal Tool to Community Resource: What's the Extra Work and Who Will Do It?. In Conference on Computer Supported Cooperative Work (CSCW). 417--430. http://doi.acm.org/10.1145/2675133.2675172Google ScholarGoogle Scholar
  34. Jason Tsay, Todd Mummert, Norman Bobroff, Alan Braz, and Martin Hirzel. 2018. Runway: Machine Learning Model Experiment Management Tool. In Conference on Systems and Machine Learning(SysML).Google ScholarGoogle Scholar
  35. Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2014. OpenML: Networked Science in Machine Learning. SIGKDD Explorations Newsletter 15, 2 (June 2014), 49--60. http://doi.acm.org/10.1145/2641190.2641198Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Manasi Vartak, Harihar Subramanyam, Wei-En Lee, Srinidhi Viswanathan, Saadiyah Husnoo, Samuel Madden, and Matei Zaharia. 2016. ModelDB: A System for Machine Learning Model Management. In Workshop on Human-In-the-Loop Data Analytics (HILDA). 14:1-14:3. http://doi.acm.org/10.1145/2939502.2939516Google ScholarGoogle Scholar
  37. Mandana Vaziri, Louis Mandel, Avraham Shinnar, Jérôme Siméon, and Martin Hirzel. 2017. Generating Chat Bots from Web API Specifications. In Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!). 44--57. http://doi.acm.org/10.1145/3133850.3133864Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Z Wan, X Xia, D Lo, and G C Murphy. 2019. How does Machine Learning Change Software Development Practices? IEEE Transactions on Software Engineering (2019), 1. https://doi.org/10.1109/TSE.2019.2937083Google ScholarGoogle Scholar
  39. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. AIMMX: Artificial Intelligence Model Metadata Extractor
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MSR '20: Proceedings of the 17th International Conference on Mining Software Repositories
            June 2020
            675 pages
            ISBN:9781450375177
            DOI:10.1145/3379597

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 September 2020

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Upcoming Conference

            ICSE 2025

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader