"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

Authors:
Himabindu Lakkaraju

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Osbert Bastani

University of Pennsylvania, Philadelphia, PA, USA

University of Pennsylvania, Philadelphia, PA, USA
View Profile

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and SocietyFebruary 2020Pages 79–85https://doi.org/10.1145/3375627.3375833

Published:07 February 2020Publication History

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

Pages 79–85

ABSTRACT

As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. There has been recent concern that a high-fidelity explanation of a black box ML model may not accurately reflect the biases in the black box. As a consequence, explanations have the potential to mislead human users into trusting a problematic black box. In this work, we rigorously explore the notion of misleading explanations and how they influence user trust in black box models. Specifically, we propose a novel theoretical framework for understanding and generating misleading explanations, and carry out a user study with domain experts to demonstrate how these explanations can be used to mislead users. Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.

References

Rakesh Agrawal and Ramakrishnan Srikant. 2004. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 487--499.Google ScholarDigital Library
Osbert Bastani, Carolyn Kim, and Hamsa Bastani. 2017. Interpretability via model extraction. arXiv preprint arXiv:1706.09773 (2017).Google Scholar
Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Knowledge Discovery and Data Mining (KDD) .Google Scholar
Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. 2019. Explanations can be manipulated and geometry is to blame. arXiv preprint arXiv:1906.07983 (2019).Google Scholar
Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).Google Scholar
Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3681--3688.Google ScholarDigital Library
Samir Khuller, Anna Moss, and Joseph Seffi Naor. 1999. The budgeted maximum coverage problem. Inform. Process. Lett. , Vol. 70, 1 (1999), 39--45.Google ScholarDigital Library
Carolyn Kim and Osbert Bastani. 2019. Learning Interpretable Models with Causal Guarantees. arXiv preprint arXiv:1901.08576 (2019).Google Scholar
Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisions and machine predictions. The quarterly journal of economics , Vol. 133, 1 (2017), 237--293.Google ScholarCross Ref
Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2019. An evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1902.00006 (2019).Google Scholar
Himabindu Lakkaraju, Stephen H. Bach, and Jure Leskovec. 2016. Interpretable Decision Sets: A Joint Framework for Description and Prediction. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) . 1675--1684.Google ScholarDigital Library
Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. 2019. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 131--138.Google ScholarDigital Library
Jon Lee, Vahab S Mirrokni, Viswanath Nagarajan, and Maxim Sviridenko. 2009. Non-monotone submodular maximization under matroid and knapsack constraints. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 323--332.Google ScholarDigital Library
Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan. 2015. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Annals of Applied Statistics (2015).Google Scholar
Zachary C Lipton. 2016. The mythos of model interpretability. arXiv preprint arXiv:1606.03490 (2016).Google Scholar
Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. 4765--4774.Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD) .Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence .Google Scholar
Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence , Vol. 1, 5 (2019), 206.Google ScholarCross Ref
Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GPDR. Harv. JL & Tech. , Vol. 31 (2017), 841.Google Scholar
Qingyuan Zhao and Trevor Hastie. 2019. Causal interpretations of black-box models. Journal of Business & Economic Statistics just-accepted (2019), 1--19.Google Scholar

Index Terms

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations
1. Computing methodologies
  1. Machine learning
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies

Recommendations

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations ...
Read More
Do Explanations Improve the Quality of AI-assisted Human Decisions? An Algorithm-in-the-Loop Analysis of Factual & Counterfactual Explanations
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems

The increased use of AI algorithmic aids in high-stakes decision making has prompted interest in explainable AI (xAI), and the role of counterfactual explanations to increase trust in human-algorithm collaborations and to mitigate unfair outcomes. ...
Read More
Understanding acceptance of dedicated e-textbook applications for learning

Purpose -- The prevalence of information technology (IT) has been introducing new trends into learning modalities spurring dedicated e-textbook applications for learning. However, there is only a limited understanding of what factors drive university ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
February 2020
439 pages
ISBN:9781450371100
DOI:10.1145/3375627
General Chairs:
Annette Markham
Aarhus University | Loyola University
,
Julia Powles
University of Western Australia
,
Toby Walsh
TU Berlin | University of New South Wales | Data61
,
Anne L. Washington
New York University
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
black box explanations
model interpretability
user studies
user trust in machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate61of162submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 103
  Total Citations
  View Citations
- 3,123
  Total Downloads
- Downloads (Last 12 months)713
- Downloads (Last 6 weeks)119
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Do Explanations Improve the Quality of AI-assisted Human Decisions? An Algorithm-in-the-Loop Analysis of Factual & Counterfactual Explanations

Understanding acceptance of dedicated e-textbook applications for learning