skip to main content
10.1145/3351095.3372873acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open Access

Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing

Published:27 January 2020Publication History

ABSTRACT

Rising concern for the societal implications of artificial intelligence systems has inspired a wave of academic and journalistic literature in which deployed systems are audited for harm by investigators from outside the organizations deploying the algorithms. However, it remains challenging for practitioners to identify the harmful repercussions of their own systems prior to deployment, and, once deployed, emergent issues can become difficult or impossible to trace back to their source.

In this paper, we introduce a framework for algorithmic auditing that supports artificial intelligence system development end-to-end, to be applied throughout the internal organization development life-cycle. Each stage of the audit yields a set of documents that together form an overall audit report, drawing on an organization's values or principles to assess the fit of decisions made throughout the process. The proposed auditing framework is intended to contribute to closing the accountability gap in the development and deployment of large-scale artificial intelligence systems by embedding a robust process to ensure audit integrity.

Skip Supplemental Material Section

Supplemental Material

References

  1. Omar Y Al-Jarrah, Paul D Yoo, Sami Muhaidat, George K Karagiannidis, and Kamal Taha. 2015. Efficient machine learning for big data: A review. Big Data Research 2, 3 (2015), 87--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amel Bennaceur, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. 2019. Requirements Engineering. In Handbook of Software Engineering. Springer, 51--92.Google ScholarGoogle Scholar
  3. Li Bing, Akintola Akintoye, Peter J Edwards, and Cliff Hardcastle. 2005. The allocation of risk in PPP/PFI construction projects in the UK. International Journal of project management 23, 1 (2005), 25--35.Google ScholarGoogle Scholar
  4. Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1123--1132.Google ScholarGoogle ScholarCross RefCross Ref
  5. Shona L Brown and Kathleen M Eisenhardt. 1995. Product development: Past research, present findings, and future directions. Academy of management review 20, 2 (1995), 343--378.Google ScholarGoogle Scholar
  6. Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. 2014. Using Frankencerts for Automated Adversarial Testing of Certificate Validation. In in SSL/TLS Implementations, âĂİ IEEE Symposium on Security and Privacy. Citeseer.Google ScholarGoogle Scholar
  7. Joanna J Bryson, Mihailis E Diamantis, and Thomas D Grant. 2017. Of, for, and by the people: the legal lacuna of synthetic persons. Artificial Intelligence and Law 25, 3 (2017), 273--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77--91.Google ScholarGoogle Scholar
  9. Jenna Burrell. 2016. How the machine "thinks": Understanding opacity in machine learning algorithms. Big Data & Society 3, 1 (2016), 2053951715622512.Google ScholarGoogle ScholarCross RefCross Ref
  10. Paul Eric Byrnes, Abdullah Al-Awadhi, Benita Gullvist, Helen Brown-Liburd, Ryan Teeter, J Donald Warren Jr, and Miklos Vasarhelyi. 2018. Evolution of Auditing: From the Traditional Approach to the Future Audit 1. In Continuous Auditing: Theory and Application. Emerald Publishing Limited, 285--297.Google ScholarGoogle Scholar
  11. Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134--148.Google ScholarGoogle Scholar
  12. Angèle Christin. 2017. Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society 4, 2 (2017), 2053951717718855.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kai Lai Chung and Paul Erdös. 1952. On the application of the Borel-Cantelli lemma. Trans. Amer. Math. Soc. 72, 1 (1952), 179--186.Google ScholarGoogle ScholarCross RefCross Ref
  14. Rachel Courtland. 2018. Bias detectives: the researchers striving to make algorithms fair. Nature 558, 7710 (2018), 357--357.Google ScholarGoogle Scholar
  15. Stephanie Cuccaro-Alamin, Regan Foust, Rhema Vaithianathan, and Emily Putnam-Hornstein. 2017. Risk assessment and decision making in child protective services: Predictive risk modeling in context. Children and Youth Services Review 79 (2017), 291--298.Google ScholarGoogle ScholarCross RefCross Ref
  16. Michael A Cusumano and Stanley A Smith. 1995. Beyond the waterfall: Software development at Microsoft. (1995).Google ScholarGoogle Scholar
  17. Nicholas Diakopoulos. 2014. Algorithmic accountability reporting: On the investigation of black boxes. (2014).Google ScholarGoogle Scholar
  18. Roel Dobbe, Sarah Dean, Thomas Gilbert, and Nitin Kohli. 2018. A Broader View on Bias in Automated Decision-Making: Reflecting on Epistemology and Dynamics. arXiv preprint arXiv:1807.00553 (2018).Google ScholarGoogle Scholar
  19. Kevin Driscoll, Brendan Hall, Håkan Sivencrona, and Phil Zumsteg. 2003. Byzantine fault tolerance, from theory to reality. In International Conference on Computer Safety, Reliability, and Security. Springer, 235--248.Google ScholarGoogle ScholarCross RefCross Ref
  20. Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847 (2017).Google ScholarGoogle Scholar
  21. Virginia Eubanks. 2018. A child abuse prediction model fails poor families. Wired Magazine (2018).Google ScholarGoogle Scholar
  22. Sellywati Mohd Faizal, Mohd Rizal Palil, Ruhanita Maelah, and Rosiati Ramli. 2017. Perception on justice, trust and tax compliance behavior in Malaysia. Kasetsart Journal of Social Sciences 38, 3 (2017), 226--232.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jonathan Furner. 2016. "Data": The data. In Information Cultures in the Digital Age. Springer, 287--306.Google ScholarGoogle Scholar
  24. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).Google ScholarGoogle Scholar
  25. Jeremy Goldhaber-Fiebert and Lea Prince. 2019. Impact Evaluation of a Predictive Risk Modeling Tool for Allegheny CountyâĂŹs Child Welfare Office. Pittsburgh: Allegheny County.[Google Scholar] (2019).Google ScholarGoogle Scholar
  26. Ben Green and Yiling Chen. 2019. Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 90--99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Daniel Greene, Anna Lauren Hoffmann, and Luke Stark. 2019. Better, nicer, clearer, fairer: A critical assessment of the movement for ethical artificial intelligence and machine learning. In Proceedings of the 52nd Hawaii International Conference on System Sciences.Google ScholarGoogle ScholarCross RefCross Ref
  28. Shixiang Gu and Luca Rigazio. 2014. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068 (2014).Google ScholarGoogle Scholar
  29. John Haigh. 2012. Probability: A very short introduction. Vol. 310. Oxford University Press.Google ScholarGoogle ScholarCross RefCross Ref
  30. Brendan Hall and Kevin Driscoll. 2014. Distributed System Design Checklist. (2014).Google ScholarGoogle Scholar
  31. Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudík, and Hanna Wallach. 2018. Improving fairness in machine learning systems: What do industry practitioners need? arXiv preprint arXiv:1812.05239 (2018).Google ScholarGoogle Scholar
  32. IEEE. 2008. IEEE Standard for Software Reviews and Audits. IEEE Std 1028-2008 (Aug 2008), 1--53. Google ScholarGoogle ScholarCross RefCross Ref
  33. Kristen Intemann. 2010. 25 years of feminist empiricism and standpoint theory: Where are we now? Hypatia 25, 4 (2010), 778--796.Google ScholarGoogle ScholarCross RefCross Ref
  34. Anna Jobin, Marcello Ienca, and Effy Vayena. 2019. Artificial Intelligence: the global landscape of ethics guidelines. arXiv preprint arXiv:1906.11668 (2019).Google ScholarGoogle Scholar
  35. Paul A Judas and Lorraine E Prokop. 2011. A historical compilation of software metrics with applicability to NASA's Orion spacecraft flight software sizing. Innovations in Systems and Software Engineering 7, 3 (2011), 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Emily Keddell. 2019. Algorithmic Justice in Child Protection: Statistical Fairness, Social Justice and the Implications for Practice. Social Sciences 8, 10 (2019), 281.Google ScholarGoogle ScholarCross RefCross Ref
  37. Svetlana Kiritchenko and Saif M Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. arXiv preprint arXiv:1805.04508 (2018).Google ScholarGoogle Scholar
  38. Nitin Kohli, Renata Barreto, and Joshua A Kroll. 2018. Translation Tutorial: A Shared Lexicon for Research and Practice in Human-Centered Software Systems. In 1st Conference on Fairness, Accountability, and Transparancy. New York, NY, USA. 7.Google ScholarGoogle Scholar
  39. Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev. 165 (2016), 633.Google ScholarGoogle Scholar
  40. Arie W Kruglanski. 1996. Motivated social cognition: Principles of the interface. (1996).Google ScholarGoogle Scholar
  41. Joel Lehman. 2019. Evolutionary Computation and AI Safety: Research Problems Impeding Routine and Safe Real-world Application of Evolution. arXiv preprint arXiv:1906.10189 (2019).Google ScholarGoogle Scholar
  42. Nancy Leveson. 2011. Engineering a safer world: Systems thinking applied to safety. MIT press.Google ScholarGoogle Scholar
  43. Jie Liu. 2012. The enterprise risk management and the risk oriented internal audit. Ibusiness 4, 03 (2012), 287.Google ScholarGoogle ScholarCross RefCross Ref
  44. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision. 3730--3738.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Amanda H Lynch and Siri Veland. 2018. Urgency in the Anthropocene. MIT Press.Google ScholarGoogle Scholar
  46. Thomas Maillart, Mingyi Zhao, Jens Grossklags, and John Chuang. 2017. Given enough eyeballs, all bugs are shallow? Revisiting Eric Raymond with bug bounty programs. Journal of Cybersecurity 3, 2 (2017), 81--90.Google ScholarGoogle ScholarCross RefCross Ref
  47. Michele Merler, Nalini Ratha, Rogerio S Feris, and John R Smith. 2019. Diversity in faces. arXiv preprint arXiv:1901.10436 (2019).Google ScholarGoogle Scholar
  48. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 220--229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Brent Mittelstadt. 2019. AI Ethics: Too Principled to Fail? SSRN (2019).Google ScholarGoogle Scholar
  50. Brent Daniel Mittelstadt and Luciano Floridi. 2016. The ethics of big data: current and foreseeable issues in biomedical contexts. Science and engineering ethics 22, 2 (2016), 303--341.Google ScholarGoogle Scholar
  51. Laura Moy. 2019. How Police Technology Aggravates Racial Inequity: A Taxonomy of Problems and a Path Forward. Available at SSRN 3340898 (2019).Google ScholarGoogle Scholar
  52. Fabian Muniesa, Marc Lenglet, et al. 2013. Responsible innovation in finance: directions and implications. Responsible Innova-tion: Managing the Responsible Emergence of Science and Innovation in Society. Wiley, London (2013), 185--198.Google ScholarGoogle Scholar
  53. Kristina Murphy. 2003. Procedural justice and tax compliance. Australian Journal of Social Issues (Australian Council of Social Service) 38, 3 (2003).Google ScholarGoogle Scholar
  54. Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. nyu Press.Google ScholarGoogle Scholar
  55. Institute of Internal Auditors. Research Foundation and Institute of Internal Auditors. 2007. The Professional Practices Framework. Inst of Internal Auditors.Google ScholarGoogle Scholar
  56. General Assembly of the World Medical Association et al. 2014. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. The Journal of the American College of Dentists 81, 3 (2014), 14.Google ScholarGoogle Scholar
  57. Cathy O'neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.Google ScholarGoogle Scholar
  58. Charles Parker. 2012. Unexpected challenges in large scale machine learning. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. ACM, 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Fiona D Patterson and Kevin Neailey. 2002. A risk register database system to aid the management of project risk. International Journal of Project Management 20, 5 (2002), 365--374.Google ScholarGoogle ScholarCross RefCross Ref
  60. W Price and II Nicholson. 2017. Regulating black-box medicine. Mich. L. Rev. 116 (2017), 421.Google ScholarGoogle ScholarCross RefCross Ref
  61. James Quesada, Laurie Kain Hart, and Philippe Bourgois. 2011. Structural vulnerability and health: Latino migrant laborers in the United States. Medical anthropology 30, 4 (2011), 339--362.Google ScholarGoogle Scholar
  62. Inioluwa Deborah Raji and Joy Buolamwini. 2019. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In AAAI/ACM Conf. on AI Ethics and Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Clarence Rodrigues and Stephen Cusick. 2011. Commercial aviation safety 5/e. McGraw Hill Professional.Google ScholarGoogle Scholar
  64. G Sirgo Rodríguez, M Olona Cabases, MC Martin Delgado, F Esteban Reboll, A Pobo Peris, M Bodí Saera, et al. 2014. Audits in real time for safety in critical care: definition and pilot study. Medicina intensiva 38, 8 (2014), 473--482.Google ScholarGoogle Scholar
  65. Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. 2014. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry 22 (2014).Google ScholarGoogle Scholar
  66. David Satava, Cam Caldwell, and Linda Richards. 2006. Ethics and the auditing culture: Rethinking the foundation of accounting and auditing. Journal of Business Ethics 64, 3 (2006), 271--284.Google ScholarGoogle ScholarCross RefCross Ref
  67. David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).Google ScholarGoogle Scholar
  68. Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable machines. Fordham L. Rev. 87 (2018), 1085.Google ScholarGoogle Scholar
  69. Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 59--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Hetan Shah. 2018. Algorithmic accountability. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376, 2128 (2018), 20170362.Google ScholarGoogle ScholarCross RefCross Ref
  71. Dominic SB Soh and Nonna Martinov-Bennie. 2011. The internal audit function: Perceptions of internal audit roles, effectiveness and evaluation. Managerial Auditing Journal 26, 7 (2011), 605--622.Google ScholarGoogle ScholarCross RefCross Ref
  72. Diomidis H Stamatis. 2003. Failure mode and effect analysis: FMEA from theory to execution. ASQ Quality press.Google ScholarGoogle Scholar
  73. Jack Stilgoe, Richard Owen, and Phil Macnaghten. 2013. Developing a framework for responsible innovation. Research Policy 42, 9 (2013), 1568--1580.Google ScholarGoogle ScholarCross RefCross Ref
  74. Alexander Styhre. 2015. The financialization of the firm: Managerial and social implications. Edward Elgar Publishing.Google ScholarGoogle Scholar
  75. Alexander Styhre. 2018. The unfinished business of governance: towards new governance regimes. In The Unfinished Business of Governance. Edward Elgar Publishing.Google ScholarGoogle Scholar
  76. JohnK Taylor. 2018. Quality assurance of chemical measurements. Routledge.Google ScholarGoogle Scholar
  77. Marie B Teixeira, Marie Teixeira, and Richard Bradley. 2013. Design controls for the medical device industry. CRC press.Google ScholarGoogle Scholar
  78. Manuel Trajtenberg. 2018. AI as the next GPT: a Political-Economy Perspective. Technical Report. National Bureau of Economic Research.Google ScholarGoogle Scholar
  79. Frank Vanclay. 2003. International principles for social impact assessment. Impact assessment and project appraisal 21, 1 (2003), 5--12.Google ScholarGoogle Scholar
  80. Tim Vanderveen. 2005. Averting highest-risk errors is first priority. Patient Safety and Quality Healthcare 2 (2005), 16--21.Google ScholarGoogle Scholar
  81. Ajit Kumar Verma, Srividya Ajit, Durga Rao Karanki, et al. 2010. Reliability and safety engineering. Vol. 43. Springer.Google ScholarGoogle Scholar
  82. Jess Whittlestone, Rune Nyrup, Anna Alexandrova, and Stephen Cave. 2019. The Role and Limits of Principles in AI Ethics: Towards a Focus on Tensions. In Proceedings of the AAAI/ACM Conference on AI Ethics and Society, Honolulu, HI, USA. 27--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Yi Zeng, Enmeng Lu, and Cunqing Huangfu. 2018. Linking Artificial Intelligence Principles. arXiv preprint arXiv:1812.04814 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader