Skip to main content
Erschienen in: Empirical Software Engineering 1/2023

01.01.2023

Selecting third-party libraries: the data scientist’s perspective

verfasst von: Sarah Nadi, Nourhan Sakr

Erschienen in: Empirical Software Engineering | Ausgabe 1/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

With the increased reliance on data-driven decisions and software services, data scientists are becoming an integral part of many software teams and enterprise operations. To perform their tasks, data scientists rely on various third-party libraries (e.g., pandas in Python for data wrangling or ggplot in R for data visualization). Selecting the right library to use is often a difficult task, with many factors influencing this selection. While there has been a lot of research on the factors that software developers take into account when selecting a library, it is not clear if these factors influence data scientists’ library selection in the same way, especially given several differences between both groups. To address this gap, we replicate a recent survey of library selection factors, but target data scientists instead of software developers. Our survey of 90 participants shows that data scientists consider several factors when selecting libraries to use, with technical factors such as the usability of the library, fit for purpose, and documentation being the three highest influencing factors. Additionally, we find that there are 11 factors that data scientists rate differently than software developers. For example, data scientists are influenced more by the collective experience of the community but less by the library’s security or license. We also uncover new factors that influence data scientists’ library selection, such as the statistical rigor of the library. We triangulate our survey results with feedback from five focus groups involving 18 additional data science experts with various roles, whose input allow us to further interpret our survey results. We discuss the implications of our findings for data science library maintainers as well as researchers who want to design recommender and/or comparison systems that help data scientists with library selection.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note that the layout of the survey sometimes combines questions of different categories to optimize the flow of the survey. For example, we ask participants about their current role at the beginning of the factor ratings to contextualize the information, while we keep all optional demographic questions at the end. The exact survey we use is available on our artifact page (Artifact can be found at https://​doi.​org/​10.​6084/​m9.​figshare.​16563885.​v1).
 
3
Thanks to the authors for releasing their raw rating data (Larios Vargas et al. 2020a), which allowed us to reproduce their results and enabled a direct distribution comparison
 
Literatur
Zurück zum Zitat Abdalkareem R, Nourry O, Wehaibi S, Mujahid S, Shihab E (2017) Why do developers use trivial packages? an empirical case study on npm. In: Proceedings of the 11th joint meeting on foundations of software engineering, ser. ESEC/FSE 2017. https://doi.org/10.1145/3106237.3106267. Association for Computing Machinery, New York, pp 385–395 Abdalkareem R, Nourry O, Wehaibi S, Mujahid S, Shihab E (2017) Why do developers use trivial packages? an empirical case study on npm. In: Proceedings of the 11th joint meeting on foundations of software engineering, ser. ESEC/FSE 2017. https://​doi.​org/​10.​1145/​3106237.​3106267. Association for Computing Machinery, New York, pp 385–395
Zurück zum Zitat Biswas S, Wardat M, Rajan H (2021) The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. arXiv:2112.01590 Biswas S, Wardat M, Rajan H (2021) The art and practice of data science pipelines: a comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. arXiv:2112.​01590
Zurück zum Zitat Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71CrossRef Czerwonka J, Nagappan N, Schulte W, Murphy B (2013) Codemine: building a software development data analytics platform at microsoft. IEEE Softw 30(4):64–71CrossRef
Zurück zum Zitat De La Mora FL, Nadi S (2018a) An empirical study of metric-based comparisons of software libraries. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, ser. PROMISE’18. https://doi.org/10.1145/3273934.3273937. Association for Computing Machinery, New York, pp 22–31 De La Mora FL, Nadi S (2018a) An empirical study of metric-based comparisons of software libraries. In: Proceedings of the 14th international conference on predictive models and data analytics in software engineering, ser. PROMISE’18. https://​doi.​org/​10.​1145/​3273934.​3273937. Association for Computing Machinery, New York, pp 22–31
Zurück zum Zitat De La Mora, FL, Nadi S (2018b) Which library should i use?: A metric-based comparison of software libraries. In: Proceedings of the 40th IEEE/ACM international conference on software engineering: new ideas and emerging technologies results (ICSE-NIER), pp 37–40 De La Mora, FL, Nadi S (2018b) Which library should i use?: A metric-based comparison of software libraries. In: Proceedings of the 40th IEEE/ACM international conference on software engineering: new ideas and emerging technologies results (ICSE-NIER), pp 37–40
Zurück zum Zitat Dong H, Zhou S, Guo J, Kästner C (2021) Splitting, renaming, removing: a study of common cleaning activities in jupyter notebooks. In: Proceedings of the 9tn international workshop on realizing artificial intelligence synergies in software engineering (RAISE), p 11 Dong H, Zhou S, Guo J, Kästner C (2021) Splitting, renaming, removing: a study of common cleaning activities in jupyter notebooks. In: Proceedings of the 9tn international workshop on realizing artificial intelligence synergies in software engineering (RAISE), p 11
Zurück zum Zitat El-Hajj R, Nadi S (2020) LibComp: an IntelliJ plugin for comparing Java libraries. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2020. https://doi.org/10.1145/3368089.3417922. Association for Computing Machinery, New York, pp 1591–1595 El-Hajj R, Nadi S (2020) LibComp: an IntelliJ plugin for comparing Java libraries. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ser. ESEC/FSE 2020. https://​doi.​org/​10.​1145/​3368089.​3417922. Association for Computing Machinery, New York, pp 1591–1595
Zurück zum Zitat Gizas A, Christodoulou S, Papatheodorou T (2012) Comparative evaluation of javascript frameworks. In: Proceedings of the 21st international conference on world wide web. WWW ’12 Companion. https://doi.org/10.1145/2187980.2188103. Association for Computing Machinery, New York, pp 513–514 Gizas A, Christodoulou S, Papatheodorou T (2012) Comparative evaluation of javascript frameworks. In: Proceedings of the 21st international conference on world wide web. WWW ’12 Companion. https://​doi.​org/​10.​1145/​2187980.​2188103. Association for Computing Machinery, New York, pp 513–514
Zurück zum Zitat Harris H, Murphy S, Vaisman M (2013) Analyzing the analyzers: an introspective survey of data scientists and their work. O’Reilly Media, Inc. Harris H, Murphy S, Vaisman M (2013) Analyzing the analyzers: an introspective survey of data scientists and their work. O’Reilly Media, Inc.
Zurück zum Zitat Hora A, Valente MT (2015) Apiwave: keeping track of api popularity and migration. In: Proceedings of the 31st IEEE international conference on software maintenance and evolution, ser. ICSME ’15. IEEE Computer Society, Washington, pp 321–323 Hora A, Valente MT (2015) Apiwave: keeping track of api popularity and migration. In: Proceedings of the 31st IEEE international conference on software maintenance and evolution, ser. ICSME ’15. IEEE Computer Society, Washington, pp 321–323
Zurück zum Zitat Hu J, Joung J, Jacobs M, Gajos KZ, Seltzer MI (2020) Improving data scientist efficiency with provenance. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE), pp 1086–1097 Hu J, Joung J, Jacobs M, Gajos KZ, Seltzer MI (2020) Improving data scientist efficiency with provenance. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE), pp 1086–1097
Zurück zum Zitat Kandel S, Paepcke A, Hellerstein JM, Heer J (2012) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917–2926CrossRef Kandel S, Paepcke A, Hellerstein JM, Heer J (2012) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917–2926CrossRef
Zurück zum Zitat Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–11 Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI conference on human factors in computing systems, pp 1–11
Zurück zum Zitat Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: Proceedings of the 38th IEEE/ACM international conference on software engineering (ICSE), IEEE, pp 96–107 Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: Proceedings of the 38th IEEE/ACM international conference on software engineering (ICSE), IEEE, pp 96–107
Zurück zum Zitat Kim M, Zimmermann T, DeLine R, Begel A (2018) Data scientists in software teams: state of the art and challenges. IEEE Trans Softw Eng 44 (11):1024–1038CrossRef Kim M, Zimmermann T, DeLine R, Begel A (2018) Data scientists in software teams: state of the art and challenges. IEEE Trans Softw Eng 44 (11):1024–1038CrossRef
Zurück zum Zitat Kontio J, Lehtola L, Bragge J (2004) Using the focus group method in software engineering: obtaining practitioner and user experiences. In: Proceedings of the international symposium on empirical software engineering (ISESE’04), IEEE, pp 271–280 Kontio J, Lehtola L, Bragge J (2004) Using the focus group method in software engineering: obtaining practitioner and user experiences. In: Proceedings of the international symposium on empirical software engineering (ISESE’04), IEEE, pp 271–280
Zurück zum Zitat Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020b) Selecting third-party libraries: the practitioners’ perspective. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE). https://doi.org/10.1145/3368089.3409711. Association for Computing Machinery, New York, pp 245–256 Larios Vargas E, Aniche M, Treude C, Bruntink M, Gousios G (2020b) Selecting third-party libraries: the practitioners’ perspective. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE). https://​doi.​org/​10.​1145/​3368089.​3409711. Association for Computing Machinery, New York, pp 245–256
Zurück zum Zitat Ma Y, Mockus A, Zaretzki R, Bichescu B, Bradley R (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Trans Softw Eng 48(2):485–501CrossRef Ma Y, Mockus A, Zaretzki R, Bichescu B, Bradley R (2020) A methodology for analyzing uptake of software technologies among developers. IEEE Trans Softw Eng 48(2):485–501CrossRef
Zurück zum Zitat Mileva YM, Dallmeier V, Burger M, Zeller A (2009) Mining trends of library usage. In: Proceedings of the joint international and annual ERCIM workshops on principles of software evolution (IWPSE) and software evolution (Evol) workshops, ser. IWPSE-Evol ’09. ACM, New York, pp 57–62 Mileva YM, Dallmeier V, Burger M, Zeller A (2009) Mining trends of library usage. In: Proceedings of the joint international and annual ERCIM workshops on principles of software evolution (IWPSE) and software evolution (Evol) workshops, ser. IWPSE-Evol ’09. ACM, New York, pp 57–62
Zurück zum Zitat Muller M, Lange I, Wang D, Piorkowski D, Tsay J, Liao QV, Dugan C, Erickson T (2019) How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–15 Muller M, Lange I, Wang D, Piorkowski D, Tsay J, Liao QV, Dugan C, Erickson T (2019) How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI conference on human factors in computing systems, pp 1–15
Zurück zum Zitat Myers BA, Stylos J (2016) Improving api usability. Commun ACM 59(6):62–69CrossRef Myers BA, Stylos J (2016) Improving api usability. Commun ACM 59(6):62–69CrossRef
Zurück zum Zitat Nahar N, Zhou S, Lewis G, Kästner C (2022) Collaboration challenges in building ml-enabled systems: communication, documentation, engineering, and process. In: Proceedings of the 44th international conference on software engineering (ICSE ’22) Nahar N, Zhou S, Lewis G, Kästner C (2022) Collaboration challenges in building ml-enabled systems: communication, documentation, engineering, and process. In: Proceedings of the 44th international conference on software engineering (ICSE ’22)
Zurück zum Zitat Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, Malík P, Hluchỳ L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124CrossRef Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I, Malík P, Hluchỳ L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52(1):77–124CrossRef
Zurück zum Zitat Ni A, Ramos D, Yang AZH, Lynce I, Manquinho V, Martins R, Le Goues C (2021) Soar: a synthesis approach for data science api refactoring. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 112–124 Ni A, Ramos D, Yang AZH, Lynce I, Manquinho V, Martins R, Le Goues C (2021) Soar: a synthesis approach for data science api refactoring. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE), pp 112–124
Zurück zum Zitat Pano A, Graziotin D, Abrahamsson P (2018) Factors and actors leading to the adoption of a javascript framework. Empir Softw Eng 23(6):3503–3534CrossRef Pano A, Graziotin D, Abrahamsson P (2018) Factors and actors leading to the adoption of a javascript framework. Empir Softw Eng 23(6):3503–3534CrossRef
Zurück zum Zitat Patil DJ (2011) Building data science teams. O’Reilly Media, Inc. Patil DJ (2011) Building data science teams. O’Reilly Media, Inc.
Zurück zum Zitat Piccioni M, Furia CA, Meyer B (2013) An empirical study of api usability. In: ACM/IEEE international symposium on empirical software engineering and measurement, pp 5–14 Piccioni M, Furia CA, Meyer B (2013) An empirical study of api usability. In: ACM/IEEE international symposium on empirical software engineering and measurement, pp 5–14
Zurück zum Zitat Pressman RS (2005) Software engineering: a practitioner’s approach. Macmillan, PalgraveMATH Pressman RS (2005) Software engineering: a practitioner’s approach. Macmillan, PalgraveMATH
Zurück zum Zitat Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, et al. (2019) Data science through the looking glass and what we found there. arXiv:1912.09536 Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, et al. (2019) Data science through the looking glass and what we found there. arXiv:1912.​09536
Zurück zum Zitat Ralph P, bin Ali N, Baltes S, Bianculli D, Diaz J, Dittrich Y, Ernst N, Felderer M, Feldt R, Filieri A, de França BBN, Furia CA, Gay G, Gold N, Graziotin D, He P, Hoda R, Juristo N, Kitchenham B, Lenarduzzi V, Martínez J, Melegati J, Mendez D, Menzies T, Molleri J, Pfahl D, Robbes R, Russo D, Saarimäki N, Sarro F, Taibi D, Siegmund J, Spinellis D, Staron M, Stol K, Storey M-A, Taibi D, Tamburri D, Torchiano M, Treude C Turhan B, Wang X, Vegas S (2020) Empirical standards for software engineering research. arXiv:2010.03525 Ralph P, bin Ali N, Baltes S, Bianculli D, Diaz J, Dittrich Y, Ernst N, Felderer M, Feldt R, Filieri A, de França BBN, Furia CA, Gay G, Gold N, Graziotin D, He P, Hoda R, Juristo N, Kitchenham B, Lenarduzzi V, Martínez J, Melegati J, Mendez D, Menzies T, Molleri J, Pfahl D, Robbes R, Russo D, Saarimäki N, Sarro F, Taibi D, Siegmund J, Spinellis D, Staron M, Stol K, Storey M-A, Taibi D, Tamburri D, Torchiano M, Treude C Turhan B, Wang X, Vegas S (2020) Empirical standards for software engineering research. arXiv:2010.​03525
Zurück zum Zitat Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732CrossRef Robillard MP, DeLine R (2011) A field study of API learning obstacles. Empir Softw Eng 16(6):703–732CrossRef
Zurück zum Zitat Stančin I, Jović A (2019) An overview and comparison of free python libraries for data mining and big data analysis. In: 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), IEEE, pp 977–982 Stančin I, Jović A (2019) An overview and comparison of free python libraries for data mining and big data analysis. In: 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), IEEE, pp 977–982
Zurück zum Zitat Teyton C, Falleri J-R, Blanc X (2012) Mining library migration graphs. In: Proceedings of the 19th working conference on reverse engineering (WCRE), pp 289–298 Teyton C, Falleri J-R, Blanc X (2012) Mining library migration graphs. In: Proceedings of the 19th working conference on reverse engineering (WCRE), pp 289–298
Zurück zum Zitat Teyton C, Falleri J-R, Palyart M, Blanc X (2014) A study of library migrations in java. J Softw Evol Process 26(11):1030–1052CrossRef Teyton C, Falleri J-R, Palyart M, Blanc X (2014) A study of library migrations in java. J Softw Evol Process 26(11):1030–1052CrossRef
Zurück zum Zitat Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: Proceedings of the 20th working conference on reverse engineering (WCRE), pp 182–191 Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: Proceedings of the 20th working conference on reverse engineering (WCRE), pp 182–191
Zurück zum Zitat Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: 20th working conference on reverse engineering (WCRE), pp 182–191 Thung F, Lo D, Lawall J (2013) Automated library recommendation. In: 20th working conference on reverse engineering (WCRE), pp 182–191
Zurück zum Zitat Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering, ser. ASE ’17 Uddin G, Khomh F (2017) Automatic summarization of API reviews. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering, ser. ASE ’17
Zurück zum Zitat Xu B, An L, Thung F, Khomh F, Lo D (2020) Why reinventing the wheels? an empirical study on library reuse and re-implementation. Empir Softw Eng 25(1):755–789CrossRef Xu B, An L, Thung F, Khomh F, Lo D (2020) Why reinventing the wheels? an empirical study on library reuse and re-implementation. Empir Softw Eng 25(1):755–789CrossRef
Zurück zum Zitat Yang C, Zhou S, Guo JL, Kästner C (2021) Subtle bugs everywhere: generating documentation for data wrangling code. In: Proceedings of the 36th IEEE/ACM international conference on automated software engineering (ASE), vol 11 Yang C, Zhou S, Guo JL, Kästner C (2021) Subtle bugs everywhere: generating documentation for data wrangling code. In: Proceedings of the 36th IEEE/ACM international conference on automated software engineering (ASE), vol 11
Metadaten
Titel
Selecting third-party libraries: the data scientist’s perspective
verfasst von
Sarah Nadi
Nourhan Sakr
Publikationsdatum
01.01.2023
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 1/2023
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-022-10241-3

Weitere Artikel der Ausgabe 1/2023

Empirical Software Engineering 1/2023 Zur Ausgabe

Premium Partner