survey

Voice Disguise in Automatic Speaker Recognition

Author:
Mireia FarrÚs

Universitat Pompeu Fabra, Catalonia

Universitat Pompeu Fabra, Catalonia

0000-0002-7160-9513
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 51 Issue 4Article No.: 68pp 1–22https://doi.org/10.1145/3195832

Published:06 July 2018Publication History

ACM Computing Surveys

Abstract

Humans are able to identify other people’s voices even in voice disguise conditions. However, we are not immune to all voice changes when trying to identify people from voice. Likewise, automatic speaker recognition systems can also be deceived by voice imitation and other types of disguise. Taking into account the voice disguise classification into the combination of two different categories (deliberate/non-deliberate and electronic/non-electronic), this survey provides a literature review on the influence of voice disguise in the automatic speaker recognition task and the robustness of these systems to such voice changes. Additionally, the survey addresses existing applications dealing with voice disguise and analyzes some issues for future research.

References

Kanae Amino, Hisanori Makinae, and Toshiaki Kamada. 2018. Auditory discrimination of natural speech and synthetic speech used as voice disguise. Acoustic. Sci. Technol. 39, 1 (2018), 48--50.Google ScholarCross Ref
Walter D. Andrews, Mary A. Kohler, Joseph P. Campbell, John J. Godfrey, and Jaime Hernández-Cordero. 2002. Gender-dependent phonetic refraction for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 149--152.Google Scholar
Bishner Saroop Atal. 1972. Automatic speaker recognition based on pitch contours. J. Acoustic. Soc. Amer. 52, 6B (Dec. 1972), 1687--1697.Google ScholarCross Ref
Katarina Bartokva, David Le-Gac, Delphine Jauvet, and Denis Jouvet. 2002. Prosodic parameter for speaker identification. In Proceedings of the 7th International Conference on Spoken Language Processing. 1197--1200.Google Scholar
Jacob Benesty, Shoji Makino, and Jingdong Chen (Eds.). 2005. Speech Enhancement. Springer.Google Scholar
Richard H. Bolt, Franklin S. Cooper, Edward E. David Jr., Peter B. Denes, James M. Pickett, and Kenneth N. Stevens. 1969. Identification of a speaker by speech spectrograms. Science 166, 3903 (Oct. 1969), 338--342.Google ScholarCross Ref
Markus Bruckl and Walter F. Sendlmeier. 2003. Aging female voices: An acoustic and perceptive analysis. In Proceedings of the Conference on Voice Quality (VOQUAL’03). 163--168.Google Scholar
Janet E. Cahn. 1990. The generation of affect in synthesized speech. J. American Voice I/O Soc. 8 (1990), 1--9.Google Scholar
Joseph P. Campbell. 1997. Speaker recognition: A tutorial. Proc. IEEE 85 (Sept. 1997), 1437--1462. Retrieved from http://ieeexplore.ieee.org/xpl/login.jsp?tp&equal;8arnumber&equal;628714.Google ScholarCross Ref
Michael J. Carey, Eluned S. Parris, Harvey Lloyd-Thomas, and Stephen Bennett. 1996. Robust prosodic features for speaker identification. In Proceedings of the 4th International Conference on Spoken Language Processing. 800--1803.Google ScholarCross Ref
Rolf Carlson, Bjorn Granstrom, and Lennart Nord. 1992. Experiments with emotive speech, acted utterances and synthesized replicas. Speech Commun. 11, 1 (March 1992), 347--355.Google Scholar
Li Chen and Yingchun Yang. 2011. Applying emotional factor analysis and I-vector to emotional speaker recognition. In Proceedings of the 6th Chinese Conference on Biometric Recognition (CCBR’11) (Lecture Notes in Computer Science), Zhenan Sun, Jianhuang Lai, and Xilin Chen Tieniu Tan (Eds.). Springer, Berlin. 174--179. Google ScholarDigital Library
Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans. Audio, Speech Lang. Process. 22, 12 (Dec. 2014), 1859--1872. Google ScholarDigital Library
Sharada V. Chougule and Mahesh S. Chavan. 2015. Robust spectral features for automatic speaker recognition in mismatch condition. In Proceedings of the 2nd International Symposium on Computer Vision and the Internet (VisionNet’15), Vol. 58. Elsevier. 272--279.Google Scholar
Jessica Clark and Paul Foulkes. 2007. Identification of voices in electronically disguised speech. Int. J. Speech Lang. Law 14, 2 (Dec. 2007).Google Scholar
Christophe d’Alessandro. 2006. Voice source parameters and prosodic analysis. In Language Context and Cognition. Methods in Empirical Prosody Research, Anita Steube (Ed.). Walter de Gruyter, Berlin/New York, 63--88.Google Scholar
Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 4 (Aug. 1980), 357--366.Google ScholarCross Ref
Najim Dehak. 2009. Discriminative and Generative Approaches for Long- and Short-Term Speaker Characteristics Modeling: Application to Speaker Verification. PhD Dissertation. École de Technologie Supérieure, Montréal, Canada. Google ScholarDigital Library
Najim Dehak, Reda Dehak, J. Glass, Douglas Reynolds, and Patrick Kenny. 2010. Cosine similarity scoring without score normalization techniques. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 71--75.Google Scholar
Véronique Delvaux, Lise Caucheteux, Kathy Huet, Myriam Piccaluga, and Bernard Harmegnies. 2017. Voice disguise vs. impersonation: Acoustic and perceptual measurements of vocal flexibility in non experts. Proceedings of the Interspeech Conference. 3777--3781.Google ScholarCross Ref
George Doddington. 2001. Speaker recognition based on idiolectal differences between speakers. In Proceedings of the Eurospeech Conference, Vol. 4. 2521--2524.Google Scholar
Helenca Duxans. 2006. Voice Conversion Applied to Text-to-Speech Systems. PhD Dissertation. Universitat Politècnica de Catalunya, Department od Signal Processing and Communications, Barcelona, Catalonia.Google Scholar
Anders Eriksson and Par Wretling. 1997. How flexible is the human voice? - A case study of mimicry. In Proceedings of the Eurospeech Conference. ISCA. 1043--1046. Retrieved from http://www.ling.gu.se/&sim;anders/papers/a1008.pdf.Google Scholar
Carol Y. Espy-Wilson, Sandeep Manocha, and Srikanth Vishnubhotla. 2006. A new set of features for text-independent speaker identification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 1475--1478. Retrieved from http://www.isr.umd.edu/Labs/SCL/publications/conference/espy_manocha_vish_icslp_06.pdf.Google ScholarCross Ref
Gunnar Fant. 1960. Acoustic Theory of Speech Production: With Calculations Based on X-ray Studies of Russian Articulations. Mouton and Co., The Hague, Netherlands.Google Scholar
Mireia Farrús. 2008. Fusing Prosodic and Acoustic Information for Speaker Recognition. PhD Dissertation. Universitat Politècnica de Catalunya, Barcelona, Catalonia.Google Scholar
Mireia Farrús, Erik Eriksson, Kirk P. H. Sullivan, and Javier Hernando. 2006a. Dialect imitations in speaker recognition. In Proceedings of the European IAFL Conference on Forensic Linguistics, Language and the Law. 347--353.Google Scholar
Mireia Farrús, Ainara Garde, Pascual Ejarque, Jordi Luque, and Javier Hernando. 2006b. On the fusion of prosody, voice spectrum and face features for multimodal person verification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 2106--2109.Google ScholarCross Ref
Mireia Farrús and Javier Hernando. 2009. Using jitter and shimmer in speaker verification. IET Signal Process. 3, 4 (July 2009), 247--257.Google ScholarCross Ref
Mireia Farrús, Javier Hernando, and Pascual Ejarque. 2007. Jitter and shimmer measurements for speaker recognition. In Proceedings of the 8th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008a. How vulnerable are prosodic features to professional imitators? In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’08).Google Scholar
Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008b. Robustness of prosodic features to voice imitation. In Proceedings of the Interspeech Conference.Google ScholarCross Ref
Mireia Farrús, Michael Wagner, Daniel Erro, and Havier Hernando. 2010. Automatic speaker recognition as a measurement of voice imitation and conversion. Int. J. Speech Lang. Law 1, 17 (2010), 980--988.Google Scholar
Carole T. Ferrand. 2002. Harmonics-to-noise ratio: An index of vocal aging. J. Voice 16, 4 (Dec. 2002), 480--487.Google ScholarCross Ref
Mohamed Fezari, Fethi Amara, and Ibrahim M. M. El-Emary. 2014. Acoustic analysis for detection of voice disorders using adaptive features and classifiers. In Proceedings of the International Conference on Circuits, Systems and Control. 112--117.Google Scholar
James L. Flanagan. 1972. Speech Analysis, Synthesis and Perception. Springer, Berlin.Google Scholar
Corinne Fredouille, Gilles Pouchoulin, Jean-Franois Bonastre, Marion Azzarello, Antoine Giovanni, and Alain Ghio. 2005. Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia). In Proceedings of the Interspeech Conference. ISCA, 149--152.Google ScholarCross Ref
Marius Vasile Ghiurcau, Corneliu Rusu, and Jaakko Astola. 2011. A study of the effect of emotional state upon text-independent speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE. 4944--4947.Google ScholarCross Ref
Herbert Gish and Michael Schmidt. 1994. Text-independent speaker identification. IEEE Signal Process. Mag. 11, 4 (Oct. 1994), 18--32. Retrieved from http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp&equal;8arnumber&equal;317924.Google ScholarCross Ref
Christer Gobl and Ailbhe Ní Chasaide. 2003. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 40, 1--2 (April 2003), 189--212. Google ScholarDigital Library
Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, and Anne-Maria Laukkanen. 2015. Automatic versus human speaker verification: The case of voice mimicry. Speech Commun. 72 (May 2015), 13--31.Google Scholar
Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, and Anne-Maria Laukkanen. 2013. I-vectors meet imitators: On vulnerability of speaker verification systems against voice mimicry. In Proceedings of the Interspeech Conference. ISCA. 930--934.Google Scholar
Rosa González-Hautamäki, Md Sahidullah, Ville Hautamäki, and Tomi Kinnunen. 2017. Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Commun. 95 (2017), 1--15. Google ScholarDigital Library
Nate Halloran. 2003. The Acquisition of a Stage Dialect. Master’s thesis. Portland State University, Portland, OR.Google Scholar
David E. Hartman. 1979. The perceptual identity and characteristics of aging in normal male adult speakers. J. Commun. Disord. 12, 1 (Feb. 1979), 53--61.Google ScholarCross Ref
Hynek Hermansky. 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoustic. Soc. Amer. 87, 4 (Aug. 1990), 1738--1752.Google ScholarCross Ref
Harry Hollien, Gea DeJong, Camilo A. Martin, R. Schwartz, and Kristen Liljegren. 2001a. Effects of ethanol intoxication on speech suprasegmentals. J. Acoustic. Soc. Amer. 110, 6 (Dec. 2001), 3198--206.Google ScholarCross Ref
Harry Hollien, Kristen Liljegren, Camilo A. Martin, and Gea DeJong. 2001b. Production of intoxication states by actorsacoustic and temporal characteristics. J. Forensic Sci. 46, 1 (Feb. 2001), 68--73.Google ScholarCross Ref
John Paul Hosom, Alexander B. Kain, Taniya Mishra, Jan P. H. Van Santen, Melanie Fried-Oken, and Janice Staehely. 2003. Intelligibility of modifications to dysarthric speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 1. IEEE. 924--927.Google ScholarCross Ref
Mark Huckvale and Anne-Linn Kristiansen. 2012. Effectiveness of electronic voice disguise between friends. In Proceedings of the 46th International Conference: Audio Forensics. Retrieved from http://www.aes.org/e-lib/browse.cfm?elib&equal;16337.Google Scholar
Tom Johnstone. 2001. The Effect of Emotion on Voice Production and Speech Acoustics. PhD Dissertation. University of Western Australia, Psychology Department, Perth, Australia.Google Scholar
Tom Johnstone and Klaus R. Scherer. 1999. The effects of emotions on voice quality. In Proceedings of the 14th International Conference of Phonetic Sciences. 2029--2032. Retrieved from http://www.keck.waisman.wisc.edu/&sim;tjohnstone/0602.pdf.Google Scholar
Sachin S. Kajarekar, Harry Bratt, Elizabeth Shriberg, and Rafael De León. 2006. A study of intentional voice modifications for evading automatic speaker recognition. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’06). ISCA. 1--6.Google ScholarCross Ref
Ahilan Kanagasundaram, Robbie Vogt, David Dean, and Michael Mason. 2011. i-vector based speaker recognition on short utterances. In Proceedings of the Interspeech Conference. ISCA. 2341--2344.Google ScholarCross Ref
Harleen Kaur. 2017. Speaker Identification of Disguised Voices Using MFCC Statistical Moment And SVM Classifier. Ph.D. Dissertation. Thapar Institute of Engineering 8 Technology, Patiala, India.Google Scholar
Finnian Kelly, Rahim Saeidi, Naomi Harte, and David van Leeuwen. 2014. Effect of long-term ageing on i-vector speaker verification. In Proceedings of the Interspeech Conference. International Speech Communication Association. 86--90. Retrieved from http://www.mee.tcd.ie/&sim;sigmedia/pmwiki/uploads/Main.Publications/finnian_interspeech14.pdf.Google ScholarCross Ref
Patrick Kenny, Pierre Ouellet, Najim Dehak, Vishwa Gupta, and Pierre Dumouchel. 2008. A study of inter-speaker variability in speaker verification. IEEE Trans. Audio, Speech, Lang. Process. 16, 5 (2008), 980--988. Google ScholarDigital Library
Lawrence G. Kersta. 1962. Voiceprint identification. Nature 4861 (Dec. 1962), 1253--1257.Google Scholar
Tomi Kinnunen and Paavo Alku. 2009. On separating glottal source and vocal tract information in telephony speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 4545--4548. Google ScholarDigital Library
Tatsuya Kitamura. 2008. Acoustic analysis of imitated voice produced by a professional impersonator. In Proceedings of the Interspeech Conference. ISCA. 813--816.Google ScholarCross Ref
Fritz Klingholz, R. Penning, and E. Liebhardt. 1988. Recognition of low-level alcohol intoxication from speech signal. J. Acoustic. Soc. Amer. 84, 3 (Sept. 1988), 929--935.Google ScholarCross Ref
Hisayoshi Kojima, Wilbur J. Gould, Anthony Lambiase, and Nobuhiko Isshiki. 1982. Computer analysis of hoarseness. Acta Oto-laryngologica 89, 3--6 (Jan. 1982), 547--554.Google Scholar
Jody Kreiman and Bruce R. Gerratt. 2005. Perception of aperiodicity in pathological voice. J. Acoustic. Soc. Amer. 117, 4 (May 2005), 2201--2211. http://www.ncbi.nlm.nih.gov/pubmed/15898661Google ScholarCross Ref
Hermann J. Künzel. 2000. Effects of voice disguise on speaking fundamental frequency. Forensic Linguist. 7, 2 (Dec. 2000), 149--179.Google Scholar
Hermann J. Künzel, Joaquín González-Rodríguez, and Javier Ortega-García. 2004. Effect of voice disguise on the performance of a forensic automatic speaker recognition system. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’04). ISCA. 153--156. Retrieved from http://www.isca-speech.org/archive_open/odyssey_04/ody4_153.html.Google Scholar
Yee W. Lau, Dat Tran, and Michael Wagner. 2004. Vulnerability of speaker verification to voice mimicking. In Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing. 145--148.Google Scholar
Yee W. Lau, Dat Tran, and Michael Wagner. 2005. Testing voice mimicry with the YOHO speaker verification corpus. In Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems (Lecture Notes in Computer Science), Vol. 3684. Springer. 15--20. Google ScholarDigital Library
John Laver. 1994. Principles of Phonetics. Cambridge University Press, Cambridge.Google Scholar
Xi Li, Jidong Tao, Michael T. Johnson, Joseph Soltis, Anne Savage, Kirsten M. Leong, and John D. Newman. 2005. Stress and emotion classification using jitter and shimmer features. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), Vol. 4. 1081--1084.Google Scholar
Johan Lindberg and Mats Blomberg. 1999. Vulnerability in speaker verification. A study of technical impostor techniques. In Proceedings of the Eurospeech Conference. 1211--1214.Google Scholar
Sue Ellen Linville. 2001. Vocal Aging. Singular Publishing Group, San Diego.Google Scholar
Robert C. Lummis and Aaron E. Rosenberg. 1972. Test of an automatic speaker verification method with intensively trained professional mimics. J. Acoustic. Soc. Amer. 51, 131 (Jan. 1972).Google ScholarCross Ref
Evangeline Machlin. 1975. Dialects for the Stage. Routledge/Theater Arts, New York.Google Scholar
John Makhoul. 1975. Linear prediction: A tutorial review. Proc. IEEE 53, 4 (April 1975), 561--580.Google Scholar
Duncan Markham. 1997. Phonetic Imitation, Accent, and the Learner. PhD Dissertation. Lund University, Lund, Sweden.Google Scholar
Judith A. Markowitz. 1996. Using Speech Recognition. Prentice Hall PTR, Upper Saddle River, N.J.Google Scholar
Judith A. Markowitz. 2007. The many roles of speaker classification in speaker verification and identification. In Speaker Classification I, Christian Mueller (Ed.). Springer, Berlin. 218--225. Google ScholarDigital Library
Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro Shikano, and Nick Campbell. 2002. Evaluation of cross-language voice conversion using bilingual and non-bilingual databases. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02). 293--296.Google Scholar
Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano, and Nick Campbell. 2001. Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of the Eurospeech Conference. 361--364.Google Scholar
Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi. 2000. Imposture using synthetic speech against speaker verification based on spectrum and pitch. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), Vol. 2. 302--305.Google Scholar
Driss Matrouf, Jean-François Bonastre, and Corinne Fredouille. 2006. Effect of speech transformation on impostor acceptance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1. 933--936.Google ScholarCross Ref
Yuri Matveev. 2013. The problem of voice template aging in speaker recognition systems. In Proceedings of the 15th International Conference on Speech and Computer (SPECOM’13), Miloš Železný, Ivan Habernal, and Andrey Ronzhin (Eds.). Lecture Notes in Computer Science, Vol. 8113. Springer International Publishing. 345--353. Google ScholarDigital Library
Florian Metze, Jitendra Ajmera, Roman Englert, Udo Bub, Felix Burkhardt, Joachim Stegmann, Christian Muller, Richard Huber, Bernt Andrassy, Josef G. Bauer, and Bernhard Littel. 2007. Comparison of four approaches to age and gender recognition for telephone applications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4. IEEE, 1089--1092. Retrieved from http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?arnumber&equal;4218294.Google ScholarCross Ref
Dirk Michaelis, Matthias Frohlich, Hans Werner Strube, Eberhard Kruse, Brad Story, and Ingo R. Titze. 1998. Some simulations concerning jitter and shimmer measurement. In Proceedings of the International Workshop on Advances in Quantitative Laryngoscopy. 744--754. Retrieved from http://www.dpi.physik.uni-goettingen.de/&sim;micha/aachen98/aachen98.html.Google Scholar
Seyed Hamidreza Mohammadi and Alexander Kain. 2014. Voice conversion using deep neural networks with speaker-independent pre-training. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 19--23.Google ScholarCross Ref
Iain R. Murray and John L. Arnott. 1993. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoustic. Soc. Amer. 93, 2 (Feb. 1993), 1097--1108.Google ScholarCross Ref
Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Voice conversion in high-order eigen space using deep belief nets. In Proceedings of the Interspeech Conference. 369--372.Google ScholarCross Ref
M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009a. Effect of noise-in-speech on MFCC parameters. In Proceedings of the 9th WSEAS International Conference on Signal, Speech and Image Processing, and 9th WSEAS International Conference on Multimedia, Internet and Video Technologies. ACM. 39--43. Google ScholarDigital Library
M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009b. On the use of stress information in speech for speaker recognition. In Proceedings of the IEEE Region 10 Conference (TENCON’09). IEEE. 1--4.Google ScholarCross Ref
Barbara Peskin, Jiri Navrátil, Joy Abramson, Doug Jones, David Klusáček, Douglas A. Reynolds, and Bing Xiang. 2003. Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS02. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 792--795.Google ScholarCross Ref
Jeff Pittam. 1994. Voice in Social Interaction; An Interdisciplinary Approach. SAGE Publications, Thousand Oaks.Google Scholar
Manfred Putzer and Jacques Koreman. 1997. A german database of patterns for vocal fold vibration. Phonus 3, Institute of Phonetics, University of Saarland. 143--153.Google Scholar
Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall, Inc., Englewood Cliffs, NJ. Google ScholarDigital Library
Alan R. Reich. 1981. Detecting the presence of vocal disguise in the male voice. J. Acoustic. Soc. Amer. 69, 5 (July 1981), 1458--1460.Google ScholarCross Ref
Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2002. Exploiting High-level Information for High-performance Speaker Recognition. SuperSID Project Final Report. MIT Lincoln Laboratory, US Department of Defense, IBM, International Computer Science Institute, Oregon Graduate Institute, Carnegie Mellon University, Charles University, York University, Princeton University, Cornell University, Baltimore, MD.Google Scholar
Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2003. The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 784--787.Google ScholarCross Ref
Douglas A. Reynolds, Marc A. Zissman, Thomas F. Quatieri, and Gerald C. OLeary. 1995. The effects of telephone transmission degradations on speaker recognition performance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’95). IEEE. 329--332.Google Scholar
Robert D. Rodman. 1998. Speaker recognition of disguised voices: A program for research. In Proceedings of the Consortium on Speech Technology Conference on Speaker Recognition by Man and Machine: Directions for Forensic Applications. 9--22.Google Scholar
Robert D. Rodman and Michael S. Powell. 2000. Computer recognition of speakers who disguise their voice. In Proceedings of the International Conference on Signal Processing Applications 8 Technology.Google Scholar
William J. Ryan and Kenneth W. Burk. 1974. Perceptual and acoustic correlates of aging in the speech of males. J. Commun. Disord. 7, 2 (June 1974), 181--192.Google ScholarCross Ref
Nicolas Scheffer, jean François Bonastre, Alain Ghio, and Bernard Teston. 2001. Gémellité et reconnaissance automatique du locuteur. In Proceedings of the 25th Journées d’Etude sur la Parole (Lecture Notes in Computer Science). 445--448. Retrieved from https://hal.archives-ouvertes.fr/hal-00134198.Google Scholar
Klaus R. Scherer. 1986. Vocal affect expression: A review and a model for future research. Psychol. Bull. 99, 2 (March 1986), 143--65. Retrieved from http://www.affective-sciences.org/system/files/biblio/1986_Scherer_PsyBull.pdf.Google ScholarCross Ref
Klaus R. Scherer, Robert D. Ladd, and Kim E. A. Silverman. 1984. Vocal cues to speaker affect: Testing two models. J. Acoustic. Soc. Amer. 76, 5 (June 1984), 1346--1356. Retrieved from http://www.affective-sciences.org/system/files/biblio/1984_Scherer_JASA.pdf.Google ScholarCross Ref
Astrid Schmidt-Nielsen and Thomas H. Crystal. 2000. Speaker verification by human listeners: Experiments comparing human and machine performance using the NIST 1998 speaker evaluation data. Digital Signal Process. 10, 1--3 (Jan. 2000), 249--266. Google ScholarDigital Library
Susanne Schoetz. 2007. Acoustic analysis of adult speaker age. In Speaker Classification I, Christian Mueller (Ed.). Vol. 4343. Springer, Berlin. 88--107. Google ScholarDigital Library
Stephen Shum, Najim Dehak, Reda Dehak, and James R. Glass. 2010. Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 76--82.Google Scholar
Roger W. Shuy. 1990. Dialect as evidence in law cases. J. English Linguist. 23, 1 (April 1990), 195--208.Google ScholarCross Ref
Milan Sigmund. 2008. Automatic speaker recognition by speech signal. In Frontiers in Robotics, Automation and Control, Alexander Zemliak (Ed.). InTech.Google Scholar
Kemal Sonmez, Elizabeth Shriberg, Larry P. Heck, and Elizabeth Weintraub. 1998. Modeling dynamic prosodic variation for speaker verification. In Proceedings of the 5th International Conference on Spoken Language Processing, Vol. 7. 3189--3192.Google Scholar
Kenneth N. Stevens, Carl E. Williams, Jaime R. Carbonell, and Barbara Woods. 1968. Speaker authentication and identification: A comparison of spectrographic and auditory presentations of speech material. J. Acoustic. Soc. Amer. 44, 6 (Dec. 1968), 1596--1607.Google ScholarCross Ref
Lucian Sulica. 2011. Hoarseness. Arch. Otolaryngol. Head Neck Surg. 137, 6 (June 2011), 616--619.Google ScholarCross Ref
Kirk P. H. Sullivan and Jason Pelecanos. 2001. Revisiting carl bildts impostor: Would a speaker verification system foil him? In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication (Lecture Notes in Computer Science), Vol. 2091. Springer. 144--149. Google ScholarDigital Library
Bradford L. Swartz. 1992. Resistance of voice onset time variability to intoxication. Percept. Motor Skills 75, 2 (Oct. 1992), 415--424.Google ScholarCross Ref
Tiejun Tan. 2010. The effect of voice disguise on automatic speaker recognition. In Proceedings of the 3rd International Congress on Image and Signal Processing (CISP’10). IEEE. 3538--3541.Google ScholarCross Ref
Shahrukh K. Taseer. 2005. Speaker identification for speakers with deliberately disguised voices using glottal pulse information. In Proceedings of the Pakistan Section Multitopic Conference. IEEE. 1--5.Google ScholarCross Ref
Oscar Tosi, Herbert Oyer, William Lashbrook, Charles Pedrey, Julie Nicol, and Ernest Nash. 1972. Experiment on voice identification. J. Acoustic. Soc. Amer. 51, 6B (June 1972), 2030--2043.Google Scholar
Renetta Garrison Tull and Janet C. Rutledge. 1996. Automatic speaker recognition based on pitch contours. Proceedings of the Acoustical Society of America 131st Meeting—Lay Language Papers.Google Scholar
Lior Uzan and Lior Wolf. 2015. I know that voice: Identifying the voice actor behind the voice. In Proceedings of the International Conference on Biometrics (ICB’15). IEEE, 46--51.Google ScholarCross Ref
Ratree Wayland, Scott Gargash, and Allard Longman. 1995. Acoustic and perceptual investigation of breathy voice. J. Acoustic. Soc. Amer. 97, 5 (May 1995), 3364.Google ScholarCross Ref
Frederik Weber, Linda Manganaro, Barbara Peskin, and Elizabeth Shriberg. 2002. Using prosodic and lexical information for speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 141--144.Google Scholar
Carl E. Williams and Kenneth N. Stevens. 1972. Emotions and speech: Some acoustical correlates. J. Acoustic. Soc. Amer. 52, 4B (March 1972), 1238--1250. http://www.ohio.edu/people/leec1/documents/sociophobia/williams_stevens_1972.pdf.Google ScholarCross Ref
Frank Wittig and Christian Mueller. 2003. Implicit feedback for user-adaptive systems by analyzing the user’s speech. In Proceedings of the Workshop on Adaptivität und Benutzermodellierung in interaktiven Softwaresystemen (ABIS’03).Google Scholar
Tian Wu, Yingchun Yang, Zhaohui Wu, and Dongdong Li. 2006. MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’06). IEEE. 1--5.Google ScholarCross Ref
Zhizheng Wu and Haizhou Li. 2013. Voice conversion and spoofing attack on speaker verification systems. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA’13). IEEE. 1--9.Google ScholarCross Ref
Naoaki Yanagihara. 1967. Significance of harmonic changes and noise components in hoarseness. J. Amer. Speech-Lang.-Hear. Assoc. 10 (Sept. 1967), 531--541.Google Scholar
Eiji Yumoto. 1988. Quantitative assessment of the degree of hoarseness. J. Voice 1, 4 (Jan. 1988), 310--313.Google ScholarCross Ref
Eiji Yumoto, Wilbur J. Gould, and Thomas Baer. 1982. Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoustic. Soc. Amer. 71, 6 (June 1982), 1544--1549. http://www.ncbi.nlm.nih.gov/pubmed/7108029Google ScholarCross Ref
Elisabeth Zetterholm. 2003. Voice Imitation: A Phonetic Study of Perceptual Illusions and Acoustic Success. PhD Dissertation. Lund University, Lund, Sweden.Google Scholar
Elisabeth Zetterholm. 2006. Same speaker—Different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian International Conference on Speech Science and Technology. 70--75.Google Scholar
Elisabeth Zetterholm, Daniel Elenius, and Mats Blomberg. 2004. A comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian International Conference on Speech Science and Technology. Australian Speech Science and Technology Association, Sydney, Australia. 393--397. Retrieved from https://lup.lub.lu.se/search/publication/52907e52-0553-4228-a120-addc5e1f9d24.Google Scholar
Cuiling Zhang and Bin Lin. 2017. Acoustic analysis of whispery voice disguise in Chinese. J. Acoustic. Soc. Amer. 141, 5 (2017), 3982--3982.Google ScholarCross Ref
Cuiling Zhang and Tiejun Tan. 2008. Voice disguise and automatic speaker recognition. Forensic Sci. Int. 175, 2--3 (April 2008), 118--122.Google Scholar
Sue Anne Zollinger and Henrik Brumm. 2011. The Lombard effect. Curr. Biol. 21, 16 (Aug. 2011), R614--R615.Google ScholarCross Ref

Index Terms

Voice Disguise in Automatic Speaker Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Detection of Speaker Characteristics Using Voice Imitation
Speaker Classification II

When recognizing a voice we attend to particular features of the person's speech and voice. Through voice imitation it is possible to investigate which aspects of the human voice need to be altered to successfully mislead the listener. This suggests ...
Read More
Articulation During Voice Disguise: A Pilot Study
Speech and Computer
Abstract
Speakers can conceal their identity by deliberately changing their speech characteristics, or disguising their voices. During voice disguise, speakers alter their normal movements of the articulators, such as tongue positions, according to a ...
Read More
Voice conversion by mapping the speaker-specific features using pitch synchronous approach

The basic goal of the voice conversion system is to modify the speaker-specific characteristics, keeping the message and the environmental information contained in the speech signal intact. Speaker characteristics reflect in speech at different levels, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 51, Issue 4
July 2019
765 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3236632
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 July 2018
- Revised: 1 March 2018
- Accepted: 1 March 2018
- Received: 1 September 2016
Published in csur Volume 51, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Speaker recognition
channel degradation
robustness
voice conversion
voice disguise
voice imitation
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 759
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Voice Disguise in Automatic Speaker Recognition

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Detection of Speaker Characteristics Using Voice Imitation

Articulation During Voice Disguise: A Pilot Study

Voice conversion by mapping the speaker-specific features using pitch synchronous approach