ABSTRACT
As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. There has been recent concern that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black box models. Specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.
- Rakesh Agrawal and Ramakrishnan Srikant. 2004. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 487--499.Google ScholarDigital Library
- Osbert Bastani, Carolyn Kim, and Hamsa Bastani. 2017. Interpretability via model extraction. arXiv preprint arXiv:1706.09773 (2017).Google Scholar
- Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Knowledge Discovery and Data Mining (KDD) .Google Scholar
- Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. 2019. Explanations can be manipulated and geometry is to blame. arXiv preprint arXiv:1906.07983 (2019).Google Scholar
- Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).Google Scholar
- Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3681--3688.Google ScholarDigital Library
- Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum coverage problem. Inform. Process. Lett. , Vol. 70, 1 (1999), 39--45.Google ScholarDigital Library
- Carolyn Kim and Osbert Bastani. 2019. Learning Interpretable Models with Causal Guarantees. arXiv preprint arXiv:1901.08576 (2019).Google Scholar
- Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisions and machine predictions. The quarterly journal of economics , Vol. 133, 1 (2017), 237--293.Google ScholarCross Ref
- Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1902.00006 (2019).Google Scholar
- Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1675--1684.Google ScholarDigital Library
- Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2019. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 131--138.Google ScholarDigital Library
- Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. 2009. Non-monotone submodular maximization under matroid and knapsack constraints. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 323--332.Google ScholarDigital Library
- Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan. 2015. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Annals of Applied Statistics (2015).Google Scholar
- Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).Google Scholar
- Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. 4765--4774.Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD) .Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence .Google Scholar
- Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence , Vol. 1, 5 (2019), 206.Google ScholarCross Ref
- Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GPDR. Harv. JL & Tech. , Vol. 31 (2017), 841.Google Scholar
- Qingyuan Zhao and Trevor Hastie. 2019. Causal interpretations of black-box models. Journal of Business & Economic Statistics just-accepted (2019), 1--19.Google Scholar
Index Terms
- "How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations
Recommendations
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and SocietyAs machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations ...
Do Explanations Improve the Quality of AI-assisted Human Decisions? An Algorithm-in-the-Loop Analysis of Factual & Counterfactual Explanations
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent SystemsThe increased use of AI algorithmic aids in high-stakes decision making has prompted interest in explainable AI (xAI), and the role of counterfactual explanations to increase trust in human-algorithm collaborations and to mitigate unfair outcomes. ...
Understanding acceptance of dedicated e-textbook applications for learning
Purpose -- The prevalence of information technology (IT) has been introducing new trends into learning modalities spurring dedicated e-textbook applications for learning. However, there is only a limited understanding of what factors drive university ...
Comments