ABSTRACT
Machine learning (ML) models are now routinely deployed in domains ranging from criminal justice to healthcare. With this newfound ubiquity, ML has moved beyond academia and grown into an engineering discipline. To that end, interpretability tools have been designed to help data scientists and machine learning practitioners better understand how ML models work. However, there has been little evaluation of the extent to which these tools achieve this goal. We study data scientists' use of two existing interpretability tools, the InterpretML implementation of GAMs and the SHAP Python package. We conduct a contextual inquiry (N=11) and a survey (N=197) of data scientists to observe how they use interpretability tools to uncover common issues that arise when building and evaluating ML models. Our results indicate that data scientists over-trust and misuse interpretability tools. Furthermore, few of our participants were able to accurately describe the visualizations output by these tools. We highlight qualitative themes for data scientists' mental models of interpretability tools. We conclude with implications for researchers and tool designers, and contextualize our findings in the social science literature.
Supplemental Material
Available for Download
The auxiliary material consists of a zip file containing eight pdfs. The pdfs included are: (1) interview protocol for pilot interviews with data scientists; (2) tutorial for Generalized Additive Models (GAMs) used for our contextual inquiry; (3) tutorial for SHAP used for our contextual inquiry; (4) questions about the dataset and model that were asked during our contextual inquiry; (5) survey protocol; (6) introduction to the dataset and model, and tutorial for GAMs used for our survey; (7) introduction to the dataset and model, and tutorial for SHAP used for our survey; and (8) a figure representing the percentage of participants with low, neutral, and high deployment scores for the model used in their condition.
- Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankanhalli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI'18). ACM, NY, NY, USA, Article 582, 18 pages. DOI: http://dx.doi.org/10.1145/3173574.3174156Google ScholarDigital Library
- David Alvarez-Melis, Hal Daumé, III, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Weight of Evidence as a Basis for Human-Oriented Explanations. arXiv preprint arXiv:1910.13503 (2019).Google Scholar
- Saleema Amershi, James Fogarty, and Daniel Weld. 2012. Regroup: Interactive Machine Learning for On-demand Group Creation in Social Networks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, NY, NY, USA, 21--30. DOI: http://dx.doi.org/10.1145/2207676.2207680Google ScholarDigital Library
- Julia Angwin, Jeff Larson, Surya Mattu, and Kirchner Lauren. 2016. Machine Bias: There's software used across the country to predict future criminals. And it's biased against blacks. ProPublica, May 23 (2016), 2016. http://www.propublica.org/article/machine-bias-risk -assessments-in-criminal-sentencingGoogle Scholar
- Dean C Barnlund. 2017. A transactional model of communication. In Communication theory, Second edition, C. David Mortensen (Ed.). Routledge, 47--57.Google Scholar
- Victoria Bellotti and Keith Edwards. 2001. Intelligibility and Accountability: Human Considerations in Context-Aware Systems. Human--Computer Interaction 16, 2--4 (2001), 193--212. DOI: http://dx.doi.org/10.1207/S15327051HCI16234_05Google ScholarDigital Library
- Virginia Braun and Victoria Clarke. 2012. Thematic analysis. In APA handbook of research methods in psychology, Vol 2: Research designs: Quantitative, qualitative, neuropsychological, and biological. American Psychological Association, Washington, DC, US, 57--71. DOI: http://dx.doi.org/10.1037/13620-004Google ScholarCross Ref
- Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 104 (Nov. 2019), 24 pages. DOI: http://dx.doi.org/10.1145/3359206Google ScholarDigital Library
- Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183--186. DOI: http://dx.doi.org/10.1126/science.aal4230Google ScholarCross Ref
- Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, NY, NY, USA, 1721--1730. DOI: http://dx.doi.org/10.1145/2783258.2788613Google ScholarDigital Library
- Herbert H Clark, Robert Schreuder, and Samuel Buttrick. 1983. Common ground at the understanding of demonstrative reference. Journal of verbal learning and verbal behavior 22, 2 (1983), 245--258.Google ScholarCross Ref
- Juliet M. Corbin and Anselm Strauss. 1990. Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative sociology 13, 1 (1990), 3--21. DOI: http://dx.doi.org/10.1007/BF00988593Google ScholarCross Ref
- Janez Demsar, Blaz Zupan, Gregor Leban, and Tomaz Curk. 2004. Orange: From Experimental Machine Learning to Interactive Data Mining. In Knowledge Discovery in Databases: PKDD 2004, Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 537--539. DOI: http://dx.doi.org/10.1007/978--3--540--30116--5_58Google ScholarCross Ref
- Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).Google Scholar
- Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, David O'Brien, Stuart Schieber, James Waldo, David Weinberger, and Alexandra Wood. 2017. Accountability of AI under the law: The role of explanation. arXiv preprint arXiv:1711.01134 (2017).Google Scholar
- Paul Dourish. 2016. Algorithms and their others: Algorithmic culture in context. Big Data & Society 3, 2 (2016), 2053951716665128. DOI: http://dx.doi.org/10.1177/2053951716665128Google ScholarCross Ref
- Mary T Dzindolet, Hall P Beck, Linda G Pierce, and Lloyd A Dawe. 2001. A framework of automation use. Technical Report. Army Research Lab Aberdeen Proving Ground MD.Google Scholar
- Jerry Alan Fails and Dan R. Olsen, Jr. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (IUI '03). ACM, NY, NY, USA, 39--45. DOI: http://dx.doi.org/10.1145/604045.604056Google ScholarDigital Library
- Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 80--89. DOI: http://dx.doi.org/10.1109/dsaa.2018.00018Google ScholarCross Ref
- Herbert P. Grice. 1975. Logic and Conversation. (1975), 41--58. DOI: http://dx.doi.org/10.1163/9789004368811_003Google ScholarCross Ref
- Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139--183. DOI: http://dx.doi.org/10.1016/S0166--4115(08)62386--9Google ScholarCross Ref
- Trevor Hastie and Robert Tibshirani. 1987. Generalized Additive Models: Some Applications. J. Amer. Statist. Assoc. 82, 398 (1987), 371--386. DOI: http://dx.doi.org/10.1080/01621459.1987.10478440Google ScholarCross Ref
- Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M. Drucker. 2019. Gamut: A Design Probe to Understand How Data Scientists Understand Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, Article 579, 13 pages. DOI: http://dx.doi.org/10.1145/3290605.3300809Google ScholarDigital Library
- Jiun-Yin Jian, Ann M. Bisantz, and Colin G. Drury. 2000. Foundations for an Empirically Determined Scale of Trust in Automated Systems. International Journal of Cognitive Ergonomics 4, 1 (2000), 53--71. DOI: http://dx.doi.org/10.1207/S15327566IJCE0401_04Google ScholarCross Ref
- Jongbin Jung, Connor Concannon, Ravi Shroff, Sharad Goel, and Daniel G Goldstein. 2017. Simple rules for complex decisions. Available at SSRN 2919024 (2017). http://dx.doi.org/10.2139/ssrn.2919024Google ScholarCross Ref
- Mayank Kabra, Alice A Robie, Marta Rivera-Alba, Steven Branson, and Kristin Branson. 2013. JAABA: interactive machine learning for automatic annotation of animal behavior. Nature methods 10, 1 (2013), 64--67. DOI: http://dx.doi.org/10.1038/nmeth.2281Google ScholarCross Ref
- Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.Google Scholar
- Daniel Kahneman, Stewart Paul Slovic, Paul Slovic, and Amos Tversky. 1982. Judgment under uncertainty: Heuristics and biases. Cambridge university press.Google Scholar
- Matthew Kay, Cynthia Matuszek, and Sean A. Munson. 2015. Unequal Representation and Gender Stereotypes in Image Search Results for Occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, NY, NY, USA, 3819--3828. DOI: http://dx.doi.org/10.1145/2702123.2702520Google ScholarDigital Library
- Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 3146--3154. http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdfGoogle ScholarDigital Library
- Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! Criticism for Interpretability. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2280--2288. http://papers.nips.cc/paper/6300-examples-are-not-enough-learn-to-criticize-criticism-for-interpretability.pdfGoogle Scholar
- Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 2668--2677. http://proceedings.mlr.press/v80/kim18d.htmlGoogle Scholar
- Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI?: Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, Article 411, 14 pages. DOI: http://dx.doi.org/10.1145/3290605.3300641Google ScholarDigital Library
- Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces (IUI '15). ACM, NY, NY, USA, 126--137. DOI: http://dx.doi.org/10.1145/2678025.2701399Google ScholarDigital Library
- Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell Me More?: The Effects of Mental Model Soundness on Personalizing an Intelligent Agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, NY, NY, USA, 1--10. DOI: http://dx.doi.org/10.1145/2207676.2207678Google ScholarDigital Library
- Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Sam Gershman, Been Kim, and Finale Doshi-Velez. 2019. Human Evaluation of Models Built for Interpretability. In AAAI Conference on Human Computation and Crowdsourcing (HCOMP).Google Scholar
- Himabindu Lakkaraju and Osbert Bastani. 2020. " How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations. In Proceedings of the 2020 AAAI/ACM Conference on AI, Ethics, and Society (AIES '20). ACM, NY, NY, USA.Google ScholarDigital Library
- David B. Leake. 1991. Goal-based explanation evaluation. Cognitive Science 15, 4 (1991), 509--545. DOI: http://dx.doi.org/10.1016/0364-0213(91)80017-YGoogle ScholarCross Ref
- Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).Google Scholar
- Stine Lomborg and Patrick Heiberg Kapsch. 2019. Decoding algorithms. Media, Culture & Society (2019). DOI: http://dx.doi.org/10.1177/0163443719855301Google ScholarCross Ref
- Tania Lombrozo. 2006. The structure and function of explanations. Trends in Cognitive Sciences 10, 10 (2006), 464--470. DOI: http://dx.doi.org/10.1016/j.tics.2006.08.004Google ScholarCross Ref
- Alexandra L' heureux, Katarina Grolinger, Hany F Elyamany, and Miriam AM Capretz. 2017. Machine learning with big data: Challenges and approaches. IEEE Access 5 (2017), 7776--7797. DOI: http://dx.doi.org/10.1109/ACCESS.2017.2696365Google ScholarCross Ref
- Scott M Lundberg, Gabriel G Erion, and Su-In Lee. 2018. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018).Google Scholar
- Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774. http://papers.nips.cc/paper/7062-a-unified-approach -to-interpreting-model-predictions.pdfGoogle Scholar
- Prashan Madumal, Tim Miller, Frank Vetere, and Liz Sonenberg. 2018. Towards a Grounded Dialog Model for Explainable Artificial Intelligence. In First international workshop on socio-cognitive systems at IJCAI 2018. https://arxiv.org/abs/1806.08055Google Scholar
- Bertram F Malle. 2006. How the mind explains behavior: Folk explanations, meaning, and social interaction. Mit Press.Google Scholar
- Tim Miller. 2018. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence (2018).Google Scholar
- Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences. In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI).Google Scholar
- Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, Article 126, 15 pages. DOI: http://dx.doi.org/10.1145/3290605.3300356Google ScholarDigital Library
- Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. 2019. InterpretML: A Unified Framework for Machine Learning Interpretability. arXiv preprint arXiv:1909.09223 (2019).Google Scholar
- Donald A Norman. 2014. Some observations on mental models. In Mental models. Psychology Press, 15--22.Google Scholar
- Raja Parasuraman, Robert Molloy, and Indramani L. Singh. 1993. Performance Consequences of Automation-Induced 'Complacency'. The International Journal of Aviation Psychology 3, 1 (1993), 1--23. DOI: http://dx.doi.org/10.1207/s15327108ijap0301_1Google ScholarCross Ref
- Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison. 2008. Investigating Statistical Machine Learning As a Tool for Software Development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08). ACM, NY, NY, USA, 667--676. DOI: http://dx.doi.org/10.1145/1357054.1357160Google ScholarDigital Library
- Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. arXiv preprint arXiv:1802.07810 (2018).Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, NY, NY, USA, 1135--1144. DOI: http://dx.doi.org/10.1145/2939672.2939778Google ScholarDigital Library
- Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206--215. DOI: http://dx.doi.org/10.1038/s42256-019-0048-xGoogle ScholarCross Ref
- Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 618--626.Google ScholarCross Ref
- Lloyd S Shapley. 1997. A value for n-person games. Classics in game theory (1997), 69.Google Scholar
- Ben R Slugoski, Mansur Lalljee, Roger Lamb, and Gerald P Ginsburg. 1993. Attribution in conversational context: Effect of mutual knowledge on explanation-giving. European Journal of Social Psychology 23, 3 (1993), 219--238. DOI: http://dx.doi.org/10.1002/ejsp.2420230302Google ScholarCross Ref
- Simone Stumpf, Vidya Rajaram, Lida Li, Weng-Keen Wong, Margaret Burnett, Thomas Dietterich, Erin Sullivan, and Jonathan Herlocker. 2009. Interacting meaningfully with machine learning systems: Three experiments. International Journal of Human-Computer Studies 67, 8 (2009), 639--662. DOI: http://dx.doi.org/10.1016/j.ijhcs.2009.03.004Google ScholarDigital Library
- Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, and Albert Gordo. 2018. Learning global additive explanations for neural nets using model distillation. arXiv preprint arXiv:1801.08640 (2018).Google Scholar
- Richard Tomsett, Dave Braines, Dan Harborne, Alun Preece, and Supriyo Chakraborty. 2018. Interpretable to whom? A role-based model for analyzing interpretable machine learning systems. arXiv preprint arXiv:1806.07552 (2018).Google Scholar
- Byron C. Wallace, Kevin Small, Carla E. Brodley, Joseph Lau, and Thomas A. Trikalinos. 2012. Deploying an Interactive Machine Learning System in an Evidence-based Practice Center: Abstrackr. In Proceedings of the 2Nd ACM SIGHIT International Health Informatics Symposium (IHI '12). ACM, NY, NY, USA, 819--824. DOI: http://dx.doi.org/10.1145/2110363.2110464Google ScholarDigital Library
- Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y. Lim. 2019. Designing Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, NY, NY, USA, Article 601, 15 pages. DOI: http://dx.doi.org/10.1145/3290605.3300831Google ScholarDigital Library
- Daniel S. Weld and Gagan Bansal. 2019. The Challenge of Crafting Intelligible Intelligence. Commun. ACM 62, 6 (May 2019), 70--79. DOI: http://dx.doi.org/10.1145/3282486Google ScholarDigital Library
- Qian Yang, Nikola Banovic, and John Zimmerman. 2018a. Mapping Machine Learning Advances from HCI Research to Reveal Starting Places for Design Innovation. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI'18). ACM, NY, NY, USA, Article 130, 11 pages. DOI: http://dx.doi.org/10.1145/3173574.3173704Google ScholarDigital Library
- Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018b. Investigating How Experienced UX Designers Effectively Work with Machine Learning. In Proceedings of the 2018 Designing Interactive Systems Conference (DIS '18). ACM, NY, NY, USA, 585--596. DOI: http://dx.doi.org/10.1145/3196709.3196730Google ScholarDigital Library
- Jiaming Zeng, Berk Ustun, and Cynthia Rudin. 2017. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 180, 3 (2017), 689--722. DOI: http://dx.doi.org/10.1111/rssa.12227Google ScholarCross Ref
- Jichen Zhu, Antonios Liapis, Sebastian Risi, Rafael Bidarra, and G. Michael Youngblood. 2018. Explainable AI for designers: A human-centered perspective on mixed-initiative co-creation. In 2018 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, 1--8. DOI: http://dx.doi.org/10.1109/CIG.2018.8490433Google ScholarDigital Library
Index Terms
- Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning
Recommendations
Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs
CSCWAs the use of machine learning (ML) models in product development and data-driven decision-making processes became pervasive in many domains, people's focus on building a well-performing model has increasingly shifted to understanding how their model ...
AutoML and Interpretability: Powering the Machine Learning Revolution in Healthcare
FODS '20: Proceedings of the 2020 ACM-IMS on Foundations of Data Science ConferenceAn AutoML and interpretability are both fundamental to the successful uptake of machine learning by non-expert end users. The former will lower barriers to entry and unlock potent new capabilities that are out of reach when working with ad-hoc models, ...
From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey
DIS '23: Proceedings of the 2023 ACM Designing Interactive Systems ConferenceModels are interpretable when machine learning (ML) practitioners can readily understand the reasoning behind their predictions. Ironically, little is known about the ML practitioners’ experience of discovering and adopting novel interpretability ...
Comments