ABSTRACT
Rising concern for the societal implications of artificial intelligence systems has inspired a wave of academic and journalistic literature in which deployed systems are audited for harm by investigators from outside the organizations deploying the algorithms. However, it remains challenging for practitioners to identify the harmful repercussions of their own systems prior to deployment, and, once deployed, emergent issues can become difficult or impossible to trace back to their source.
In this paper, we introduce a framework for algorithmic auditing that supports artificial intelligence system development end-to-end, to be applied throughout the internal organization development life-cycle. Each stage of the audit yields a set of documents that together form an overall audit report, drawing on an organization's values or principles to assess the fit of decisions made throughout the process. The proposed auditing framework is intended to contribute to closing the accountability gap in the development and deployment of large-scale artificial intelligence systems by embedding a robust process to ensure audit integrity.
Supplemental Material
Available for Download
Supplemental material.
- Omar Y Al-Jarrah, Paul D Yoo, Sami Muhaidat, George K Karagiannidis, and Kamal Taha. 2015. Efficient machine learning for big data: A review. Big Data Research 2, 3 (2015), 87--93.Google ScholarDigital Library
- Amel Bennaceur, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. 2019. Requirements Engineering. In Handbook of Software Engineering. Springer, 51--92.Google Scholar
- Li Bing, Akintola Akintoye, Peter J Edwards, and Cliff Hardcastle. 2005. The allocation of risk in PPP/PFI construction projects in the UK. International Journal of project management 23, 1 (2005), 25--35.Google Scholar
- Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ml test score: A rubric for ml production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data). IEEE, 1123--1132.Google ScholarCross Ref
- Shona L Brown and Kathleen M Eisenhardt. 1995. Product development: Past research, present findings, and future directions. Academy of management review 20, 2 (1995), 343--378.Google Scholar
- Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. 2014. Using Frankencerts for Automated Adversarial Testing of Certificate Validation. In in SSL/TLS Implementations, âĂİ IEEE Symposium on Security and Privacy. Citeseer.Google Scholar
- Joanna J Bryson, Mihailis E Diamantis, and Thomas D Grant. 2017. Of, for, and by the people: the legal lacuna of synthetic persons. Artificial Intelligence and Law 25, 3 (2017), 273--291.Google ScholarDigital Library
- Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency. 77--91.Google Scholar
- Jenna Burrell. 2016. How the machine "thinks": Understanding opacity in machine learning algorithms. Big Data & Society 3, 1 (2016), 2053951715622512.Google ScholarCross Ref
- Paul Eric Byrnes, Abdullah Al-Awadhi, Benita Gullvist, Helen Brown-Liburd, Ryan Teeter, J Donald Warren Jr, and Miklos Vasarhelyi. 2018. Evolution of Auditing: From the Traditional Approach to the Future Audit 1. In Continuous Auditing: Theory and Application. Emerald Publishing Limited, 285--297.Google Scholar
- Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134--148.Google Scholar
- Angèle Christin. 2017. Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society 4, 2 (2017), 2053951717718855.Google ScholarCross Ref
- Kai Lai Chung and Paul Erdös. 1952. On the application of the Borel-Cantelli lemma. Trans. Amer. Math. Soc. 72, 1 (1952), 179--186.Google ScholarCross Ref
- Rachel Courtland. 2018. Bias detectives: the researchers striving to make algorithms fair. Nature 558, 7710 (2018), 357--357.Google Scholar
- Stephanie Cuccaro-Alamin, Regan Foust, Rhema Vaithianathan, and Emily Putnam-Hornstein. 2017. Risk assessment and decision making in child protective services: Predictive risk modeling in context. Children and Youth Services Review 79 (2017), 291--298.Google ScholarCross Ref
- Michael A Cusumano and Stanley A Smith. 1995. Beyond the waterfall: Software development at Microsoft. (1995).Google Scholar
- Nicholas Diakopoulos. 2014. Algorithmic accountability reporting: On the investigation of black boxes. (2014).Google Scholar
- Roel Dobbe, Sarah Dean, Thomas Gilbert, and Nitin Kohli. 2018. A Broader View on Bias in Automated Decision-Making: Reflecting on Epistemology and Dynamics. arXiv preprint arXiv:1807.00553 (2018).Google Scholar
- Kevin Driscoll, Brendan Hall, Håkan Sivencrona, and Phil Zumsteg. 2003. Byzantine fault tolerance, from theory to reality. In International Conference on Computer Safety, Reliability, and Security. Springer, 235--248.Google ScholarCross Ref
- Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing. arXiv preprint arXiv:1706.09847 (2017).Google Scholar
- Virginia Eubanks. 2018. A child abuse prediction model fails poor families. Wired Magazine (2018).Google Scholar
- Sellywati Mohd Faizal, Mohd Rizal Palil, Ruhanita Maelah, and Rosiati Ramli. 2017. Perception on justice, trust and tax compliance behavior in Malaysia. Kasetsart Journal of Social Sciences 38, 3 (2017), 226--232.Google ScholarCross Ref
- Jonathan Furner. 2016. "Data": The data. In Information Cultures in the Digital Age. Springer, 287--306.Google Scholar
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).Google Scholar
- Jeremy Goldhaber-Fiebert and Lea Prince. 2019. Impact Evaluation of a Predictive Risk Modeling Tool for Allegheny CountyâĂŹs Child Welfare Office. Pittsburgh: Allegheny County.[Google Scholar] (2019).Google Scholar
- Ben Green and Yiling Chen. 2019. Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 90--99.Google ScholarDigital Library
- Daniel Greene, Anna Lauren Hoffmann, and Luke Stark. 2019. Better, nicer, clearer, fairer: A critical assessment of the movement for ethical artificial intelligence and machine learning. In Proceedings of the 52nd Hawaii International Conference on System Sciences.Google ScholarCross Ref
- Shixiang Gu and Luca Rigazio. 2014. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068 (2014).Google Scholar
- John Haigh. 2012. Probability: A very short introduction. Vol. 310. Oxford University Press.Google ScholarCross Ref
- Brendan Hall and Kevin Driscoll. 2014. Distributed System Design Checklist. (2014).Google Scholar
- Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudík, and Hanna Wallach. 2018. Improving fairness in machine learning systems: What do industry practitioners need? arXiv preprint arXiv:1812.05239 (2018).Google Scholar
- IEEE. 2008. IEEE Standard for Software Reviews and Audits. IEEE Std 1028-2008 (Aug 2008), 1--53. Google ScholarCross Ref
- Kristen Intemann. 2010. 25 years of feminist empiricism and standpoint theory: Where are we now? Hypatia 25, 4 (2010), 778--796.Google ScholarCross Ref
- Anna Jobin, Marcello Ienca, and Effy Vayena. 2019. Artificial Intelligence: the global landscape of ethics guidelines. arXiv preprint arXiv:1906.11668 (2019).Google Scholar
- Paul A Judas and Lorraine E Prokop. 2011. A historical compilation of software metrics with applicability to NASA's Orion spacecraft flight software sizing. Innovations in Systems and Software Engineering 7, 3 (2011), 161--170.Google ScholarDigital Library
- Emily Keddell. 2019. Algorithmic Justice in Child Protection: Statistical Fairness, Social Justice and the Implications for Practice. Social Sciences 8, 10 (2019), 281.Google ScholarCross Ref
- Svetlana Kiritchenko and Saif M Mohammad. 2018. Examining gender and race bias in two hundred sentiment analysis systems. arXiv preprint arXiv:1805.04508 (2018).Google Scholar
- Nitin Kohli, Renata Barreto, and Joshua A Kroll. 2018. Translation Tutorial: A Shared Lexicon for Research and Practice in Human-Centered Software Systems. In 1st Conference on Fairness, Accountability, and Transparancy. New York, NY, USA. 7.Google Scholar
- Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev. 165 (2016), 633.Google Scholar
- Arie W Kruglanski. 1996. Motivated social cognition: Principles of the interface. (1996).Google Scholar
- Joel Lehman. 2019. Evolutionary Computation and AI Safety: Research Problems Impeding Routine and Safe Real-world Application of Evolution. arXiv preprint arXiv:1906.10189 (2019).Google Scholar
- Nancy Leveson. 2011. Engineering a safer world: Systems thinking applied to safety. MIT press.Google Scholar
- Jie Liu. 2012. The enterprise risk management and the risk oriented internal audit. Ibusiness 4, 03 (2012), 287.Google ScholarCross Ref
- Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision. 3730--3738.Google ScholarDigital Library
- Amanda H Lynch and Siri Veland. 2018. Urgency in the Anthropocene. MIT Press.Google Scholar
- Thomas Maillart, Mingyi Zhao, Jens Grossklags, and John Chuang. 2017. Given enough eyeballs, all bugs are shallow? Revisiting Eric Raymond with bug bounty programs. Journal of Cybersecurity 3, 2 (2017), 81--90.Google ScholarCross Ref
- Michele Merler, Nalini Ratha, Rogerio S Feris, and John R Smith. 2019. Diversity in faces. arXiv preprint arXiv:1901.10436 (2019).Google Scholar
- Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 220--229.Google ScholarDigital Library
- Brent Mittelstadt. 2019. AI Ethics: Too Principled to Fail? SSRN (2019).Google Scholar
- Brent Daniel Mittelstadt and Luciano Floridi. 2016. The ethics of big data: current and foreseeable issues in biomedical contexts. Science and engineering ethics 22, 2 (2016), 303--341.Google Scholar
- Laura Moy. 2019. How Police Technology Aggravates Racial Inequity: A Taxonomy of Problems and a Path Forward. Available at SSRN 3340898 (2019).Google Scholar
- Fabian Muniesa, Marc Lenglet, et al. 2013. Responsible innovation in finance: directions and implications. Responsible Innova-tion: Managing the Responsible Emergence of Science and Innovation in Society. Wiley, London (2013), 185--198.Google Scholar
- Kristina Murphy. 2003. Procedural justice and tax compliance. Australian Journal of Social Issues (Australian Council of Social Service) 38, 3 (2003).Google Scholar
- Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. nyu Press.Google Scholar
- Institute of Internal Auditors. Research Foundation and Institute of Internal Auditors. 2007. The Professional Practices Framework. Inst of Internal Auditors.Google Scholar
- General Assembly of the World Medical Association et al. 2014. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. The Journal of the American College of Dentists 81, 3 (2014), 14.Google Scholar
- Cathy O'neil. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books.Google Scholar
- Charles Parker. 2012. Unexpected challenges in large scale machine learning. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. ACM, 1--6.Google ScholarDigital Library
- Fiona D Patterson and Kevin Neailey. 2002. A risk register database system to aid the management of project risk. International Journal of Project Management 20, 5 (2002), 365--374.Google ScholarCross Ref
- W Price and II Nicholson. 2017. Regulating black-box medicine. Mich. L. Rev. 116 (2017), 421.Google ScholarCross Ref
- James Quesada, Laurie Kain Hart, and Philippe Bourgois. 2011. Structural vulnerability and health: Latino migrant laborers in the United States. Medical anthropology 30, 4 (2011), 339--362.Google Scholar
- Inioluwa Deborah Raji and Joy Buolamwini. 2019. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial ai products. In AAAI/ACM Conf. on AI Ethics and Society.Google ScholarDigital Library
- Clarence Rodrigues and Stephen Cusick. 2011. Commercial aviation safety 5/e. McGraw Hill Professional.Google Scholar
- G Sirgo Rodríguez, M Olona Cabases, MC Martin Delgado, F Esteban Reboll, A Pobo Peris, M Bodí Saera, et al. 2014. Audits in real time for safety in critical care: definition and pilot study. Medicina intensiva 38, 8 (2014), 473--482.Google Scholar
- Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. 2014. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry 22 (2014).Google Scholar
- David Satava, Cam Caldwell, and Linda Richards. 2006. Ethics and the auditing culture: Rethinking the foundation of accounting and auditing. Journal of Business Ethics 64, 3 (2006), 271--284.Google ScholarCross Ref
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: The high interest credit card of technical debt. (2014).Google Scholar
- Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable machines. Fordham L. Rev. 87 (2018), 1085.Google Scholar
- Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 59--68.Google ScholarDigital Library
- Hetan Shah. 2018. Algorithmic accountability. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376, 2128 (2018), 20170362.Google ScholarCross Ref
- Dominic SB Soh and Nonna Martinov-Bennie. 2011. The internal audit function: Perceptions of internal audit roles, effectiveness and evaluation. Managerial Auditing Journal 26, 7 (2011), 605--622.Google ScholarCross Ref
- Diomidis H Stamatis. 2003. Failure mode and effect analysis: FMEA from theory to execution. ASQ Quality press.Google Scholar
- Jack Stilgoe, Richard Owen, and Phil Macnaghten. 2013. Developing a framework for responsible innovation. Research Policy 42, 9 (2013), 1568--1580.Google ScholarCross Ref
- Alexander Styhre. 2015. The financialization of the firm: Managerial and social implications. Edward Elgar Publishing.Google Scholar
- Alexander Styhre. 2018. The unfinished business of governance: towards new governance regimes. In The Unfinished Business of Governance. Edward Elgar Publishing.Google Scholar
- JohnK Taylor. 2018. Quality assurance of chemical measurements. Routledge.Google Scholar
- Marie B Teixeira, Marie Teixeira, and Richard Bradley. 2013. Design controls for the medical device industry. CRC press.Google Scholar
- Manuel Trajtenberg. 2018. AI as the next GPT: a Political-Economy Perspective. Technical Report. National Bureau of Economic Research.Google Scholar
- Frank Vanclay. 2003. International principles for social impact assessment. Impact assessment and project appraisal 21, 1 (2003), 5--12.Google Scholar
- Tim Vanderveen. 2005. Averting highest-risk errors is first priority. Patient Safety and Quality Healthcare 2 (2005), 16--21.Google Scholar
- Ajit Kumar Verma, Srividya Ajit, Durga Rao Karanki, et al. 2010. Reliability and safety engineering. Vol. 43. Springer.Google Scholar
- Jess Whittlestone, Rune Nyrup, Anna Alexandrova, and Stephen Cave. 2019. The Role and Limits of Principles in AI Ethics: Towards a Focus on Tensions. In Proceedings of the AAAI/ACM Conference on AI Ethics and Society, Honolulu, HI, USA. 27--28.Google ScholarDigital Library
- Yi Zeng, Enmeng Lu, and Cunqing Huangfu. 2018. Linking Artificial Intelligence Principles. arXiv preprint arXiv:1812.04814 (2018).Google Scholar
Index Terms
- Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing
Recommendations
A Sociotechnical Audit: Assessing Police Use of Facial Recognition
FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and TransparencyAlgorithmic audits are increasingly used to hold people accountable for the algorithms they implement. However, much work remains to integrate ethical and legal evaluations of how algorithms are used into audits. In this paper, we present a ...
Accountable key infrastructure (AKI): a proposal for a public-key validation infrastructure
WWW '13: Proceedings of the 22nd international conference on World Wide WebRecent trends in public-key infrastructure research explore the tradeoff between decreased trust in Certificate Authorities (CAs), resilience against attacks, communication overhead (bandwidth and latency) for setting up an SSL/TLS connection, and ...
Comments