ABSTRACT
Automated speaker recognition uses data processing to identify speakers by their voice. Today, automated speaker recognition is deployed on billions of smart devices and in services such as call centres. Despite their wide-scale deployment and known sources of bias in related domains like face recognition and natural language processing, bias in automated speaker recognition has not been studied systematically. We present an in-depth empirical and analytical study of bias in the machine learning development workflow of speaker verification, a voice biometric and core task in automated speaker recognition. Drawing on an established framework for understanding sources of harm in machine learning, we show that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including data generation, model building, and implementation. Most affected are female speakers and non-US nationalities, who experience significant performance degradation. Leveraging the insights from our findings, we make practical recommendations for mitigating bias in automated speaker recognition, and outline future research directions.
- Martine Adda-Decker and Lori Lamel. 2005. Do speech recognizers prefer female speakers?INTERSPEECH (2005), 2205–2208. https://www.isca-speech.org/archive/interspeech_2005/addadecker05_interspeech.htmlGoogle Scholar
- Zhongxin Bai and Xiao Lei Zhang. 2021. Speaker recognition based on deep learning: An overview. Neural Networks 140(2021), 65–99. https://doi.org/10.1016/j.neunet.2021.03.004Google ScholarCross Ref
- Tolga Bolukbasi, Kai-wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker ? Debiasing Word Embeddings. In NIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems. 4356 – 4364.Google Scholar
- Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of Machine Learning Research: Conference on Fairness, Accountability, and Transparency, Vol. 81. 1889–1896.Google Scholar
- Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong Jin Lee, and Icksang Han. 2020. In defence of metric learning for speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-Octob (2020), 2977–2981. https://doi.org/10.21437/Interspeech.2020-1064Google ScholarCross Ref
- Joon Son Chung and Andrew Zisserman. 2017. Out of time: Automated lip sync in the wild. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10117 LNCS, i(2017), 251–263. https://doi.org/10.1007/978-3-319-54427-4_19Google ScholarCross Ref
- Gianni Fenu, Mirko Marras, Giacomo Medda, and Giacomo Meloni. 2021. Fair Voice Biometrics : Impact of Demographic Imbalance on Group Fairness in Speaker Recognition. (2021), 1892–1896.Google Scholar
- Sadaoki Furui. 1994. An Overview of Speaker Recognition Technology. In ESCA Workshop on Automatic Speaker Recognition, Identification and Verification. 1 – 9.Google Scholar
- Craig S. Greenberg, Lisa P. Mason, Seyed Omid Sadjadi, and Douglas A. Reynolds. 2020. Two decades of speaker recognition evaluation at the national institute of standards and technology. Computer Speech and Language 60 (2020). https://doi.org/10.1016/j.csl.2019.101032Google ScholarDigital Library
- Oxford Visual Geometry Group. 2021. The VoxCeleb Speaker Recognition Challenge 2021. https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2021.htmlGoogle Scholar
- John H.L. Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine 32, 6 (2015), 74–99. https://doi.org/10.1109/MSP.2015.2462851Google ScholarCross Ref
- Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems (2016), 3323–3331.Google ScholarDigital Library
- Khaled Hechmi, Trung Ngo Trong, Ville Hautamaki, and Tomi Kinnunen. 2021. VoxCeleb Enrichment for Age and Gender Recognition. (2021). http://arxiv.org/abs/2109.13510Google Scholar
- Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. 2016. End-to-End Text-Dependent Speaker Verification. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5115–5119.Google Scholar
- Hee Soo Heo, Bong Jin Lee, Jaesung Huh, and Joon Son Chung. 2020. Clova baseline system for the VoxCeleb speaker recognition challenge 2020. arXiv (2020), 1–3.Google Scholar
- Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. 2020. Characterising Bias in Compressed Models. https://arxiv.org/abs/2010.03058Google Scholar
- Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner. 2016. Machine Bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencingGoogle Scholar
- Elie Khoury, Laurent El Shafey, Christopher McCool, Manuel Günther, and Sébastien Marcel. 2014. Bi-modal biometric authentication on mobile phones in challenging conditions. Image and Vision Computing 32, 12 (2014), 1147–1160. https://doi.org/10.1016/j.imavis.2013.10.001Google ScholarDigital Library
- E. Khoury, B. Vesnicer, J. Franco-Pedroso, R. Violato, Z. Boulkcnafet, L. M. Mazaira Fernandez, M. Diez, J. Kosmala, H. Khemiri, T. Cipr, R. Saeidi, M. Gunther, J. Zganec-Gros, R. Zazo Candil, F. Simoes, M. Bengherabi, A. Alvarez Marquina, M. Penagarikano, A. Abad, M. Boulayemen, P. Schwarz, D. Van Leeuwen, J. Gonzalez-Dominguez, M. Uliani Neto, E. Boutellaa, P. Gomez Vilda, A. Varona, D. Petrovska-Delacretaz, P. Matejka, J. Gonzalez-Rodriguez, T. Pereira, F. Harizi, L. J. Rodriguez-Fuentes, L. El Shafey, M. Angeloni, G. Bordel, G. Chollet, and S. Marcel. 2013. The 2013 speaker recognition evaluation in mobile environment. Proceedings - 2013 International Conference on Biometrics, ICB 2013 (2013). https://doi.org/10.1109/ICB.2013.6613025Google ScholarCross Ref
- Davis E. King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758.Google ScholarDigital Library
- Tomi Kinnunen and Haizhou Li. 2009. An Overview of Text-Independent Speaker Recognition : from Features to Supervectors. Speech Communication 52, 1 (2009), 12. https://doi.org/10.1016/j.specom.2009.08.009Google ScholarDigital Library
- Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R. Rickford, Dan Jurafsky, and Sharad Goel. 2020. Racial disparities in automated speech recognition. PNAS 117, 14 (2020), 7684–7689. https://doi.org/10.1073/pnas.1915768117/-/DCSupplemental.yGoogle ScholarCross Ref
- Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. 2017. Deep speaker: An end-to-end neural speaker embedding system. arXiv (2017).Google Scholar
- Beryl Lipton and Quintin Cooper. 2021. The Catalog of Carceral Surveillance: Voice Recognition and Surveillance. https://www.eff.org/deeplinks/2021/09/catalog-carceral-surveillance-voice-recognition-and-surveillanceGoogle Scholar
- Mohamed Maouche, Brij Mohan, Lal Srivastava, Nathalie Vauquier, Marc Tommasi, Emmanuel Vincent, Mohamed Maouche, Brij Mohan, Lal Srivastava, Nathalie Vauquier, Aurélien Bellet, Marc Tommasi, Mohamed Maouche, Brij Mohan, Lal Srivastava, Nathalie Vauquier, Emmanuel Vincent, and De Lorraine. 2020. A comparative study of speech anonymization metrics. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Shanghai, China.Google ScholarCross Ref
- A Martin, G Doddington, T Kamm, M Ordowski, and M Przybocki. 1997. The DET Curve in Assessment of Detection Task Performance. Technical Report. National Institute of Standards and Technology (NIST), Gaithersburg MD. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.4489Google Scholar
- Luis Miguel Mazaira-Fernandez, Agustín Álvarez-Marquina, and Pedro Gómez-Vilda. 2015. Improving speaker recognition by biometric voice deconstruction. Frontiers in Bioengineering and Biotechnology 3, September(2015), 1–19. https://doi.org/10.3389/fbioe.2015.00126Google ScholarCross Ref
- M McLaren, L Ferrer, D Castan, and A Lawson. 2016. The Speakers in the Wild (SITW) speaker recognition database.. In Interspeech. pdfs.semanticscholar.org. https://pdfs.semanticscholar.org/3fe3/58a66359ee2660ec0d13e727eb8f3f0007c2.pdfGoogle Scholar
- Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2019. A survey on bias and fairness in machine learning. arXiv (2019).Google Scholar
- Margaret Mitchell, Dylan Baker, Nyalleng Moorosi, Emily Denton, Ben Hutchinson, Alex Hanna, Timnit Gebru, and Jamie Morgenstern. 2020. Diversity and inclusion metrics in subset selection. AIES 2020 - Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2020), 117–123. https://doi.org/10.1145/3375627.3375832Google ScholarDigital Library
- Marta Morrás. 2021. BBVA Mexico allows its pensioner customers to provide proof of life from home thanks to Veridas voice biometrics. https://veridas.com/en/bbva-mexico-allows-pensioner-customers-provide-proof-of-life-from-home/Google Scholar
- Arsha Nagrani, Joon Son Chung, Jaesung Huh, Andrew Brown, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, and Andrew Zisserman. 2020. VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge. (2020). http://arxiv.org/abs/2012.06867Google Scholar
- Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large-scale speaker verification in the wild. Computer Speech and Language 60 (2020), 101027. https://doi.org/10.1016/j.csl.2019.101027Google ScholarDigital Library
- Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: A large-scale speaker identification dataset. arXiv (2017), 2616–2620.Google Scholar
- Andreas Nautsch, Abelino Jim, Mohamed Amine, Aymen Mtibaa, Mohammed Ahmed, Alberto Abad, Francisco Teixeira, Driss Matrouf, Marta Gomez-barrero, and Dijana Petrovska-delacr. 2019. Preserving privacy in speaker and speech characterisation. Computer Speech and Language 58 (2019), 441–480. https://doi.org/10.1016/j.csl.2019.06.001Google ScholarDigital Library
- Andreas Nautsch, Jose Patino, Natalia Tomashenko, Junichi Yamagishi, Paul Gauthier Noé, Jean François Bonastre, Massimiliano Todisco, and Nicholas Evans. 2020. The privacy ZEBRA: Zero evidence biometric recognition assessment. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-Octob (2020), 1698–1702. https://doi.org/10.21437/Interspeech.2020-1815Google ScholarCross Ref
- NIST. 2019. NIST 2019 Speaker Recognition Evaluation Plan. 1 (2019), 1–7.Google Scholar
- NIST. 2020. NIST 2020 CTS Speaker Recognition Challenge Evaluation Plan. Technical Report. 1–8 pages.Google Scholar
- Soo Jin Park, Caroline Sigouin, Jody Kreiman, Patricia Keating, Jinxi Guo, Gary Yeung, Fang-Yu Kuo, and Abeer Alwan. 2016. Speaker Identity and Voice Quality: Modeling Human Responses and Automatic Speaker Recognition.. In Interspeech 2016. ISCA, San Francisco, CA, USA. https://doi.org/10.21437/Interspeech.2016-523Google ScholarCross Ref
- Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.Google Scholar
- Xiaoyi Qin, Ming Li, Hui Bu, Wei Rao, Rohan Kumar Das, Shrikanth Narayanan, and Haizhou Li. 2020. The INTERSPEECH 2020 far-field speaker verification challenge. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-Octob (2020), 3456–3460. https://doi.org/10.21437/Interspeech.2020-1249Google ScholarCross Ref
- Inioluwa Deborah Raji and Joy Buolamwini. 2019. Actionable auditing: Investigating the impact of publicly naming biased performance results of commercial AI products. AIES 2019 - Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (2019), 429–435. https://doi.org/10.1145/3306618.3314244Google ScholarDigital Library
- Inioluwa Deborah Raji and Genevieve Fried. 2021. About Face: A Survey of Facial Recognition Evaluation. (2021). http://arxiv.org/abs/2102.00813Google Scholar
- Douglas A. Reynolds. 2002. An Overview of Automatic Speaker Recognition Technology. IEEE (2002).Google Scholar
- Morgan Klaus Scheuerman, Jacob M. Paul, and Jed R. Brubaker. 2019. How computers see gender: An evaluation of gender classification in commercial facial analysis and image labeling services. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019). https://doi.org/10.1145/3359246Google ScholarDigital Library
- Lea Schönherr, Maximilian Golla, Thorsten Eisenhofer, Jan Wiele, Dorothea Kolossa, and Thorsten Holz. 2020. Unacceptable, where is my privacy? Exploring Accidental Triggers of Smart Speakers. (8 2020). http://arxiv.org/abs/2008.00508Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings(2015), 1–14.Google Scholar
- Rita Singh. 2019. Profiling Humans from their Voice. https://doi.org/10.1007/978-981-13-8403-5Google ScholarCross Ref
- D Snyder, D Garcia-Romero, D Povey, and S Khudanpur. 2017. Deep Neural Network Embeddings for Text-Independent Speaker Verification.Interspeech (2017). https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0620.PDFGoogle Scholar
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5329–5333.Google ScholarDigital Library
- Harini Suresh and John Guttag. 2021. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. In EAAMO ’21: Equity and Access in Algorithms, Mechanisms, and Optimization.Google Scholar
- Rachael Tatman and Conner Kasten. 2017. Effects of talker dialect, gender & race on accuracy of bing speech and youtube automatic captions. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2017-Augus (2017), 934–938. https://doi.org/10.21437/Interspeech.2017-1746Google ScholarCross Ref
- Wiebke Toussaint, Akhil Mathur, Aaron Yi Ding, and Fahim Kawsar. 2021. Characterising the Role of Pre-Processing Parameters in Audio-based Embedded Machine Learning. In The 3rd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things (AIChal- lengeIoT 21). Association for Computing Machinery, Coimbra, Portugal, 439–445. https://doi.org/10.1145/3485730.3493448Google ScholarDigital Library
- Wiebke Toussaint, Akhil Mathur, Fahim Kawsar, and Aaron Yi Ding. 2022. Tiny, always-on and fragile: Bias propagation through design choices in on-device machine learning workflows. (2022), 19 pages. http://arxiv.org/abs/2201.07677Google Scholar
- Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2021. Bias Preservation in Machine Learning : The Legality of Fairness Metrics Under EU Non- Discrimination Law. West Virginia Law Review, Forthcoming(2021), 1–51. https://ssrn.com/abstract=3792772Google Scholar
- Wikipedia contributors. 2022. List of languages by number of native speakers in India. https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India [Online; accessed 6-May-2022].Google Scholar
- Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-yi Lee. 2021. SUPERB: Speech processing Universal PERformance Benchmark. (2021). http://arxiv.org/abs/2105.01051Google Scholar
- Hossein Zeinali, Kong Aik Lee, Jahangir Alam, and Lukas Burget. 2020. Short-duration Speaker Verification (SdSV) Challenge 2021: the Challenge Evaluation Plan. Technical Report. 1–13 pages. http://arxiv.org/abs/1912.06311Google Scholar
Index Terms
- Bias in Automated Speaker Recognition
Recommendations
Multi-style speaker recognition database in practical conditions
This work describes the process of collection and organization of a multi-style database for speaker recognition. The multi-style database organization is based on three different categories of speaker recognition: voice-password, text-dependent and ...
The NIST 1999 Speaker Recognition Evaluation An Overview
Martin, Alvin, and Przybocki, Mark, The NIST 1999 Speaker Recognition Evaluation An Overview, Digital Signal Processing10(2000), 1 18.This article summarizes the 1999 NIST Speaker Recognition Evaluation. It discusses the overall research objectives, the ...
Speaker Verification by Human Listeners
Schmidt-Nielsen, Astrid, and Crystal, Thomas H., Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data, Digital Signal Processing10(2000), 249 266.The speaker ...
Comments