skip to main content
survey

Voice Disguise in Automatic Speaker Recognition

Published:06 July 2018Publication History
Skip Abstract Section

Abstract

Humans are able to identify other people’s voices even in voice disguise conditions. However, we are not immune to all voice changes when trying to identify people from voice. Likewise, automatic speaker recognition systems can also be deceived by voice imitation and other types of disguise. Taking into account the voice disguise classification into the combination of two different categories (deliberate/non-deliberate and electronic/non-electronic), this survey provides a literature review on the influence of voice disguise in the automatic speaker recognition task and the robustness of these systems to such voice changes. Additionally, the survey addresses existing applications dealing with voice disguise and analyzes some issues for future research.

References

  1. Kanae Amino, Hisanori Makinae, and Toshiaki Kamada. 2018. Auditory discrimination of natural speech and synthetic speech used as voice disguise. Acoustic. Sci. Technol. 39, 1 (2018), 48--50.Google ScholarGoogle ScholarCross RefCross Ref
  2. Walter D. Andrews, Mary A. Kohler, Joseph P. Campbell, John J. Godfrey, and Jaime Hernández-Cordero. 2002. Gender-dependent phonetic refraction for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 149--152.Google ScholarGoogle Scholar
  3. Bishner Saroop Atal. 1972. Automatic speaker recognition based on pitch contours. J. Acoustic. Soc. Amer. 52, 6B (Dec. 1972), 1687--1697.Google ScholarGoogle ScholarCross RefCross Ref
  4. Katarina Bartokva, David Le-Gac, Delphine Jauvet, and Denis Jouvet. 2002. Prosodic parameter for speaker identification. In Proceedings of the 7th International Conference on Spoken Language Processing. 1197--1200.Google ScholarGoogle Scholar
  5. Jacob Benesty, Shoji Makino, and Jingdong Chen (Eds.). 2005. Speech Enhancement. Springer.Google ScholarGoogle Scholar
  6. Richard H. Bolt, Franklin S. Cooper, Edward E. David Jr., Peter B. Denes, James M. Pickett, and Kenneth N. Stevens. 1969. Identification of a speaker by speech spectrograms. Science 166, 3903 (Oct. 1969), 338--342.Google ScholarGoogle ScholarCross RefCross Ref
  7. Markus Bruckl and Walter F. Sendlmeier. 2003. Aging female voices: An acoustic and perceptive analysis. In Proceedings of the Conference on Voice Quality (VOQUAL’03). 163--168.Google ScholarGoogle Scholar
  8. Janet E. Cahn. 1990. The generation of affect in synthesized speech. J. American Voice I/O Soc. 8 (1990), 1--9.Google ScholarGoogle Scholar
  9. Joseph P. Campbell. 1997. Speaker recognition: A tutorial. Proc. IEEE 85 (Sept. 1997), 1437--1462. Retrieved from http://ieeexplore.ieee.org/xpl/login.jsp?tp&equal;8arnumber&equal;628714.Google ScholarGoogle ScholarCross RefCross Ref
  10. Michael J. Carey, Eluned S. Parris, Harvey Lloyd-Thomas, and Stephen Bennett. 1996. Robust prosodic features for speaker identification. In Proceedings of the 4th International Conference on Spoken Language Processing. 800--1803.Google ScholarGoogle ScholarCross RefCross Ref
  11. Rolf Carlson, Bjorn Granstrom, and Lennart Nord. 1992. Experiments with emotive speech, acted utterances and synthesized replicas. Speech Commun. 11, 1 (March 1992), 347--355.Google ScholarGoogle Scholar
  12. Li Chen and Yingchun Yang. 2011. Applying emotional factor analysis and I-vector to emotional speaker recognition. In Proceedings of the 6th Chinese Conference on Biometric Recognition (CCBR’11) (Lecture Notes in Computer Science), Zhenan Sun, Jianhuang Lai, and Xilin Chen Tieniu Tan (Eds.). Springer, Berlin. 174--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans. Audio, Speech Lang. Process. 22, 12 (Dec. 2014), 1859--1872. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sharada V. Chougule and Mahesh S. Chavan. 2015. Robust spectral features for automatic speaker recognition in mismatch condition. In Proceedings of the 2nd International Symposium on Computer Vision and the Internet (VisionNet’15), Vol. 58. Elsevier. 272--279.Google ScholarGoogle Scholar
  15. Jessica Clark and Paul Foulkes. 2007. Identification of voices in electronically disguised speech. Int. J. Speech Lang. Law 14, 2 (Dec. 2007).Google ScholarGoogle Scholar
  16. Christophe d’Alessandro. 2006. Voice source parameters and prosodic analysis. In Language Context and Cognition. Methods in Empirical Prosody Research, Anita Steube (Ed.). Walter de Gruyter, Berlin/New York, 63--88.Google ScholarGoogle Scholar
  17. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 4 (Aug. 1980), 357--366.Google ScholarGoogle ScholarCross RefCross Ref
  18. Najim Dehak. 2009. Discriminative and Generative Approaches for Long- and Short-Term Speaker Characteristics Modeling: Application to Speaker Verification. PhD Dissertation. École de Technologie Supérieure, Montréal, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Najim Dehak, Reda Dehak, J. Glass, Douglas Reynolds, and Patrick Kenny. 2010. Cosine similarity scoring without score normalization techniques. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 71--75.Google ScholarGoogle Scholar
  20. Véronique Delvaux, Lise Caucheteux, Kathy Huet, Myriam Piccaluga, and Bernard Harmegnies. 2017. Voice disguise vs. impersonation: Acoustic and perceptual measurements of vocal flexibility in non experts. Proceedings of the Interspeech Conference. 3777--3781.Google ScholarGoogle ScholarCross RefCross Ref
  21. George Doddington. 2001. Speaker recognition based on idiolectal differences between speakers. In Proceedings of the Eurospeech Conference, Vol. 4. 2521--2524.Google ScholarGoogle Scholar
  22. Helenca Duxans. 2006. Voice Conversion Applied to Text-to-Speech Systems. PhD Dissertation. Universitat Politècnica de Catalunya, Department od Signal Processing and Communications, Barcelona, Catalonia.Google ScholarGoogle Scholar
  23. Anders Eriksson and Par Wretling. 1997. How flexible is the human voice? - A case study of mimicry. In Proceedings of the Eurospeech Conference. ISCA. 1043--1046. Retrieved from http://www.ling.gu.se/∼anders/papers/a1008.pdf.Google ScholarGoogle Scholar
  24. Carol Y. Espy-Wilson, Sandeep Manocha, and Srikanth Vishnubhotla. 2006. A new set of features for text-independent speaker identification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 1475--1478. Retrieved from http://www.isr.umd.edu/Labs/SCL/publications/conference/espy_manocha_vish_icslp_06.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gunnar Fant. 1960. Acoustic Theory of Speech Production: With Calculations Based on X-ray Studies of Russian Articulations. Mouton and Co., The Hague, Netherlands.Google ScholarGoogle Scholar
  26. Mireia Farrús. 2008. Fusing Prosodic and Acoustic Information for Speaker Recognition. PhD Dissertation. Universitat Politècnica de Catalunya, Barcelona, Catalonia.Google ScholarGoogle Scholar
  27. Mireia Farrús, Erik Eriksson, Kirk P. H. Sullivan, and Javier Hernando. 2006a. Dialect imitations in speaker recognition. In Proceedings of the European IAFL Conference on Forensic Linguistics, Language and the Law. 347--353.Google ScholarGoogle Scholar
  28. Mireia Farrús, Ainara Garde, Pascual Ejarque, Jordi Luque, and Javier Hernando. 2006b. On the fusion of prosody, voice spectrum and face features for multimodal person verification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 2106--2109.Google ScholarGoogle ScholarCross RefCross Ref
  29. Mireia Farrús and Javier Hernando. 2009. Using jitter and shimmer in speaker verification. IET Signal Process. 3, 4 (July 2009), 247--257.Google ScholarGoogle ScholarCross RefCross Ref
  30. Mireia Farrús, Javier Hernando, and Pascual Ejarque. 2007. Jitter and shimmer measurements for speaker recognition. In Proceedings of the 8th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  31. Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008a. How vulnerable are prosodic features to professional imitators? In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’08).Google ScholarGoogle Scholar
  32. Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008b. Robustness of prosodic features to voice imitation. In Proceedings of the Interspeech Conference.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mireia Farrús, Michael Wagner, Daniel Erro, and Havier Hernando. 2010. Automatic speaker recognition as a measurement of voice imitation and conversion. Int. J. Speech Lang. Law 1, 17 (2010), 980--988.Google ScholarGoogle Scholar
  34. Carole T. Ferrand. 2002. Harmonics-to-noise ratio: An index of vocal aging. J. Voice 16, 4 (Dec. 2002), 480--487.Google ScholarGoogle ScholarCross RefCross Ref
  35. Mohamed Fezari, Fethi Amara, and Ibrahim M. M. El-Emary. 2014. Acoustic analysis for detection of voice disorders using adaptive features and classifiers. In Proceedings of the International Conference on Circuits, Systems and Control. 112--117.Google ScholarGoogle Scholar
  36. James L. Flanagan. 1972. Speech Analysis, Synthesis and Perception. Springer, Berlin.Google ScholarGoogle Scholar
  37. Corinne Fredouille, Gilles Pouchoulin, Jean-Franois Bonastre, Marion Azzarello, Antoine Giovanni, and Alain Ghio. 2005. Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia). In Proceedings of the Interspeech Conference. ISCA, 149--152.Google ScholarGoogle ScholarCross RefCross Ref
  38. Marius Vasile Ghiurcau, Corneliu Rusu, and Jaakko Astola. 2011. A study of the effect of emotional state upon text-independent speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE. 4944--4947.Google ScholarGoogle ScholarCross RefCross Ref
  39. Herbert Gish and Michael Schmidt. 1994. Text-independent speaker identification. IEEE Signal Process. Mag. 11, 4 (Oct. 1994), 18--32. Retrieved from http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp&equal;8arnumber&equal;317924.Google ScholarGoogle ScholarCross RefCross Ref
  40. Christer Gobl and Ailbhe Ní Chasaide. 2003. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 40, 1--2 (April 2003), 189--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, and Anne-Maria Laukkanen. 2015. Automatic versus human speaker verification: The case of voice mimicry. Speech Commun. 72 (May 2015), 13--31.Google ScholarGoogle Scholar
  42. Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, and Anne-Maria Laukkanen. 2013. I-vectors meet imitators: On vulnerability of speaker verification systems against voice mimicry. In Proceedings of the Interspeech Conference. ISCA. 930--934.Google ScholarGoogle Scholar
  43. Rosa González-Hautamäki, Md Sahidullah, Ville Hautamäki, and Tomi Kinnunen. 2017. Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Commun. 95 (2017), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nate Halloran. 2003. The Acquisition of a Stage Dialect. Master’s thesis. Portland State University, Portland, OR.Google ScholarGoogle Scholar
  45. David E. Hartman. 1979. The perceptual identity and characteristics of aging in normal male adult speakers. J. Commun. Disord. 12, 1 (Feb. 1979), 53--61.Google ScholarGoogle ScholarCross RefCross Ref
  46. Hynek Hermansky. 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoustic. Soc. Amer. 87, 4 (Aug. 1990), 1738--1752.Google ScholarGoogle ScholarCross RefCross Ref
  47. Harry Hollien, Gea DeJong, Camilo A. Martin, R. Schwartz, and Kristen Liljegren. 2001a. Effects of ethanol intoxication on speech suprasegmentals. J. Acoustic. Soc. Amer. 110, 6 (Dec. 2001), 3198--206.Google ScholarGoogle ScholarCross RefCross Ref
  48. Harry Hollien, Kristen Liljegren, Camilo A. Martin, and Gea DeJong. 2001b. Production of intoxication states by actorsacoustic and temporal characteristics. J. Forensic Sci. 46, 1 (Feb. 2001), 68--73.Google ScholarGoogle ScholarCross RefCross Ref
  49. John Paul Hosom, Alexander B. Kain, Taniya Mishra, Jan P. H. Van Santen, Melanie Fried-Oken, and Janice Staehely. 2003. Intelligibility of modifications to dysarthric speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 1. IEEE. 924--927.Google ScholarGoogle ScholarCross RefCross Ref
  50. Mark Huckvale and Anne-Linn Kristiansen. 2012. Effectiveness of electronic voice disguise between friends. In Proceedings of the 46th International Conference: Audio Forensics. Retrieved from http://www.aes.org/e-lib/browse.cfm?elib&equal;16337.Google ScholarGoogle Scholar
  51. Tom Johnstone. 2001. The Effect of Emotion on Voice Production and Speech Acoustics. PhD Dissertation. University of Western Australia, Psychology Department, Perth, Australia.Google ScholarGoogle Scholar
  52. Tom Johnstone and Klaus R. Scherer. 1999. The effects of emotions on voice quality. In Proceedings of the 14th International Conference of Phonetic Sciences. 2029--2032. Retrieved from http://www.keck.waisman.wisc.edu/∼tjohnstone/0602.pdf.Google ScholarGoogle Scholar
  53. Sachin S. Kajarekar, Harry Bratt, Elizabeth Shriberg, and Rafael De León. 2006. A study of intentional voice modifications for evading automatic speaker recognition. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’06). ISCA. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  54. Ahilan Kanagasundaram, Robbie Vogt, David Dean, and Michael Mason. 2011. i-vector based speaker recognition on short utterances. In Proceedings of the Interspeech Conference. ISCA. 2341--2344.Google ScholarGoogle ScholarCross RefCross Ref
  55. Harleen Kaur. 2017. Speaker Identification of Disguised Voices Using MFCC Statistical Moment And SVM Classifier. Ph.D. Dissertation. Thapar Institute of Engineering 8 Technology, Patiala, India.Google ScholarGoogle Scholar
  56. Finnian Kelly, Rahim Saeidi, Naomi Harte, and David van Leeuwen. 2014. Effect of long-term ageing on i-vector speaker verification. In Proceedings of the Interspeech Conference. International Speech Communication Association. 86--90. Retrieved from http://www.mee.tcd.ie/∼sigmedia/pmwiki/uploads/Main.Publications/finnian_interspeech14.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  57. Patrick Kenny, Pierre Ouellet, Najim Dehak, Vishwa Gupta, and Pierre Dumouchel. 2008. A study of inter-speaker variability in speaker verification. IEEE Trans. Audio, Speech, Lang. Process. 16, 5 (2008), 980--988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Lawrence G. Kersta. 1962. Voiceprint identification. Nature 4861 (Dec. 1962), 1253--1257.Google ScholarGoogle Scholar
  59. Tomi Kinnunen and Paavo Alku. 2009. On separating glottal source and vocal tract information in telephony speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 4545--4548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Tatsuya Kitamura. 2008. Acoustic analysis of imitated voice produced by a professional impersonator. In Proceedings of the Interspeech Conference. ISCA. 813--816.Google ScholarGoogle ScholarCross RefCross Ref
  61. Fritz Klingholz, R. Penning, and E. Liebhardt. 1988. Recognition of low-level alcohol intoxication from speech signal. J. Acoustic. Soc. Amer. 84, 3 (Sept. 1988), 929--935.Google ScholarGoogle ScholarCross RefCross Ref
  62. Hisayoshi Kojima, Wilbur J. Gould, Anthony Lambiase, and Nobuhiko Isshiki. 1982. Computer analysis of hoarseness. Acta Oto-laryngologica 89, 3--6 (Jan. 1982), 547--554.Google ScholarGoogle Scholar
  63. Jody Kreiman and Bruce R. Gerratt. 2005. Perception of aperiodicity in pathological voice. J. Acoustic. Soc. Amer. 117, 4 (May 2005), 2201--2211. http://www.ncbi.nlm.nih.gov/pubmed/15898661Google ScholarGoogle ScholarCross RefCross Ref
  64. Hermann J. Künzel. 2000. Effects of voice disguise on speaking fundamental frequency. Forensic Linguist. 7, 2 (Dec. 2000), 149--179.Google ScholarGoogle Scholar
  65. Hermann J. Künzel, Joaquín González-Rodríguez, and Javier Ortega-García. 2004. Effect of voice disguise on the performance of a forensic automatic speaker recognition system. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’04). ISCA. 153--156. Retrieved from http://www.isca-speech.org/archive_open/odyssey_04/ody4_153.html.Google ScholarGoogle Scholar
  66. Yee W. Lau, Dat Tran, and Michael Wagner. 2004. Vulnerability of speaker verification to voice mimicking. In Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing. 145--148.Google ScholarGoogle Scholar
  67. Yee W. Lau, Dat Tran, and Michael Wagner. 2005. Testing voice mimicry with the YOHO speaker verification corpus. In Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems (Lecture Notes in Computer Science), Vol. 3684. Springer. 15--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. John Laver. 1994. Principles of Phonetics. Cambridge University Press, Cambridge.Google ScholarGoogle Scholar
  69. Xi Li, Jidong Tao, Michael T. Johnson, Joseph Soltis, Anne Savage, Kirsten M. Leong, and John D. Newman. 2005. Stress and emotion classification using jitter and shimmer features. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), Vol. 4. 1081--1084.Google ScholarGoogle Scholar
  70. Johan Lindberg and Mats Blomberg. 1999. Vulnerability in speaker verification. A study of technical impostor techniques. In Proceedings of the Eurospeech Conference. 1211--1214.Google ScholarGoogle Scholar
  71. Sue Ellen Linville. 2001. Vocal Aging. Singular Publishing Group, San Diego.Google ScholarGoogle Scholar
  72. Robert C. Lummis and Aaron E. Rosenberg. 1972. Test of an automatic speaker verification method with intensively trained professional mimics. J. Acoustic. Soc. Amer. 51, 131 (Jan. 1972).Google ScholarGoogle ScholarCross RefCross Ref
  73. Evangeline Machlin. 1975. Dialects for the Stage. Routledge/Theater Arts, New York.Google ScholarGoogle Scholar
  74. John Makhoul. 1975. Linear prediction: A tutorial review. Proc. IEEE 53, 4 (April 1975), 561--580.Google ScholarGoogle Scholar
  75. Duncan Markham. 1997. Phonetic Imitation, Accent, and the Learner. PhD Dissertation. Lund University, Lund, Sweden.Google ScholarGoogle Scholar
  76. Judith A. Markowitz. 1996. Using Speech Recognition. Prentice Hall PTR, Upper Saddle River, N.J.Google ScholarGoogle Scholar
  77. Judith A. Markowitz. 2007. The many roles of speaker classification in speaker verification and identification. In Speaker Classification I, Christian Mueller (Ed.). Springer, Berlin. 218--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro Shikano, and Nick Campbell. 2002. Evaluation of cross-language voice conversion using bilingual and non-bilingual databases. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02). 293--296.Google ScholarGoogle Scholar
  79. Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano, and Nick Campbell. 2001. Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of the Eurospeech Conference. 361--364.Google ScholarGoogle Scholar
  80. Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi. 2000. Imposture using synthetic speech against speaker verification based on spectrum and pitch. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), Vol. 2. 302--305.Google ScholarGoogle Scholar
  81. Driss Matrouf, Jean-François Bonastre, and Corinne Fredouille. 2006. Effect of speech transformation on impostor acceptance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1. 933--936.Google ScholarGoogle ScholarCross RefCross Ref
  82. Yuri Matveev. 2013. The problem of voice template aging in speaker recognition systems. In Proceedings of the 15th International Conference on Speech and Computer (SPECOM’13), Miloš Železný, Ivan Habernal, and Andrey Ronzhin (Eds.). Lecture Notes in Computer Science, Vol. 8113. Springer International Publishing. 345--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Florian Metze, Jitendra Ajmera, Roman Englert, Udo Bub, Felix Burkhardt, Joachim Stegmann, Christian Muller, Richard Huber, Bernt Andrassy, Josef G. Bauer, and Bernhard Littel. 2007. Comparison of four approaches to age and gender recognition for telephone applications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4. IEEE, 1089--1092. Retrieved from http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?arnumber&equal;4218294.Google ScholarGoogle ScholarCross RefCross Ref
  84. Dirk Michaelis, Matthias Frohlich, Hans Werner Strube, Eberhard Kruse, Brad Story, and Ingo R. Titze. 1998. Some simulations concerning jitter and shimmer measurement. In Proceedings of the International Workshop on Advances in Quantitative Laryngoscopy. 744--754. Retrieved from http://www.dpi.physik.uni-goettingen.de/∼micha/aachen98/aachen98.html.Google ScholarGoogle Scholar
  85. Seyed Hamidreza Mohammadi and Alexander Kain. 2014. Voice conversion using deep neural networks with speaker-independent pre-training. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 19--23.Google ScholarGoogle ScholarCross RefCross Ref
  86. Iain R. Murray and John L. Arnott. 1993. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoustic. Soc. Amer. 93, 2 (Feb. 1993), 1097--1108.Google ScholarGoogle ScholarCross RefCross Ref
  87. Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Voice conversion in high-order eigen space using deep belief nets. In Proceedings of the Interspeech Conference. 369--372.Google ScholarGoogle ScholarCross RefCross Ref
  88. M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009a. Effect of noise-in-speech on MFCC parameters. In Proceedings of the 9th WSEAS International Conference on Signal, Speech and Image Processing, and 9th WSEAS International Conference on Multimedia, Internet and Video Technologies. ACM. 39--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009b. On the use of stress information in speech for speaker recognition. In Proceedings of the IEEE Region 10 Conference (TENCON’09). IEEE. 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  90. Barbara Peskin, Jiri Navrátil, Joy Abramson, Doug Jones, David Klusáček, Douglas A. Reynolds, and Bing Xiang. 2003. Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS02. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 792--795.Google ScholarGoogle ScholarCross RefCross Ref
  91. Jeff Pittam. 1994. Voice in Social Interaction; An Interdisciplinary Approach. SAGE Publications, Thousand Oaks.Google ScholarGoogle Scholar
  92. Manfred Putzer and Jacques Koreman. 1997. A german database of patterns for vocal fold vibration. Phonus 3, Institute of Phonetics, University of Saarland. 143--153.Google ScholarGoogle Scholar
  93. Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall, Inc., Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Alan R. Reich. 1981. Detecting the presence of vocal disguise in the male voice. J. Acoustic. Soc. Amer. 69, 5 (July 1981), 1458--1460.Google ScholarGoogle ScholarCross RefCross Ref
  95. Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2002. Exploiting High-level Information for High-performance Speaker Recognition. SuperSID Project Final Report. MIT Lincoln Laboratory, US Department of Defense, IBM, International Computer Science Institute, Oregon Graduate Institute, Carnegie Mellon University, Charles University, York University, Princeton University, Cornell University, Baltimore, MD.Google ScholarGoogle Scholar
  96. Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2003. The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 784--787.Google ScholarGoogle ScholarCross RefCross Ref
  97. Douglas A. Reynolds, Marc A. Zissman, Thomas F. Quatieri, and Gerald C. OLeary. 1995. The effects of telephone transmission degradations on speaker recognition performance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’95). IEEE. 329--332.Google ScholarGoogle Scholar
  98. Robert D. Rodman. 1998. Speaker recognition of disguised voices: A program for research. In Proceedings of the Consortium on Speech Technology Conference on Speaker Recognition by Man and Machine: Directions for Forensic Applications. 9--22.Google ScholarGoogle Scholar
  99. Robert D. Rodman and Michael S. Powell. 2000. Computer recognition of speakers who disguise their voice. In Proceedings of the International Conference on Signal Processing Applications 8 Technology.Google ScholarGoogle Scholar
  100. William J. Ryan and Kenneth W. Burk. 1974. Perceptual and acoustic correlates of aging in the speech of males. J. Commun. Disord. 7, 2 (June 1974), 181--192.Google ScholarGoogle ScholarCross RefCross Ref
  101. Nicolas Scheffer, jean François Bonastre, Alain Ghio, and Bernard Teston. 2001. Gémellité et reconnaissance automatique du locuteur. In Proceedings of the 25th Journées d’Etude sur la Parole (Lecture Notes in Computer Science). 445--448. Retrieved from https://hal.archives-ouvertes.fr/hal-00134198.Google ScholarGoogle Scholar
  102. Klaus R. Scherer. 1986. Vocal affect expression: A review and a model for future research. Psychol. Bull. 99, 2 (March 1986), 143--65. Retrieved from http://www.affective-sciences.org/system/files/biblio/1986_Scherer_PsyBull.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  103. Klaus R. Scherer, Robert D. Ladd, and Kim E. A. Silverman. 1984. Vocal cues to speaker affect: Testing two models. J. Acoustic. Soc. Amer. 76, 5 (June 1984), 1346--1356. Retrieved from http://www.affective-sciences.org/system/files/biblio/1984_Scherer_JASA.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  104. Astrid Schmidt-Nielsen and Thomas H. Crystal. 2000. Speaker verification by human listeners: Experiments comparing human and machine performance using the NIST 1998 speaker evaluation data. Digital Signal Process. 10, 1--3 (Jan. 2000), 249--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Susanne Schoetz. 2007. Acoustic analysis of adult speaker age. In Speaker Classification I, Christian Mueller (Ed.). Vol. 4343. Springer, Berlin. 88--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Stephen Shum, Najim Dehak, Reda Dehak, and James R. Glass. 2010. Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 76--82.Google ScholarGoogle Scholar
  107. Roger W. Shuy. 1990. Dialect as evidence in law cases. J. English Linguist. 23, 1 (April 1990), 195--208.Google ScholarGoogle ScholarCross RefCross Ref
  108. Milan Sigmund. 2008. Automatic speaker recognition by speech signal. In Frontiers in Robotics, Automation and Control, Alexander Zemliak (Ed.). InTech.Google ScholarGoogle Scholar
  109. Kemal Sonmez, Elizabeth Shriberg, Larry P. Heck, and Elizabeth Weintraub. 1998. Modeling dynamic prosodic variation for speaker verification. In Proceedings of the 5th International Conference on Spoken Language Processing, Vol. 7. 3189--3192.Google ScholarGoogle Scholar
  110. Kenneth N. Stevens, Carl E. Williams, Jaime R. Carbonell, and Barbara Woods. 1968. Speaker authentication and identification: A comparison of spectrographic and auditory presentations of speech material. J. Acoustic. Soc. Amer. 44, 6 (Dec. 1968), 1596--1607.Google ScholarGoogle ScholarCross RefCross Ref
  111. Lucian Sulica. 2011. Hoarseness. Arch. Otolaryngol. Head Neck Surg. 137, 6 (June 2011), 616--619.Google ScholarGoogle ScholarCross RefCross Ref
  112. Kirk P. H. Sullivan and Jason Pelecanos. 2001. Revisiting carl bildts impostor: Would a speaker verification system foil him? In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication (Lecture Notes in Computer Science), Vol. 2091. Springer. 144--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Bradford L. Swartz. 1992. Resistance of voice onset time variability to intoxication. Percept. Motor Skills 75, 2 (Oct. 1992), 415--424.Google ScholarGoogle ScholarCross RefCross Ref
  114. Tiejun Tan. 2010. The effect of voice disguise on automatic speaker recognition. In Proceedings of the 3rd International Congress on Image and Signal Processing (CISP’10). IEEE. 3538--3541.Google ScholarGoogle ScholarCross RefCross Ref
  115. Shahrukh K. Taseer. 2005. Speaker identification for speakers with deliberately disguised voices using glottal pulse information. In Proceedings of the Pakistan Section Multitopic Conference. IEEE. 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  116. Oscar Tosi, Herbert Oyer, William Lashbrook, Charles Pedrey, Julie Nicol, and Ernest Nash. 1972. Experiment on voice identification. J. Acoustic. Soc. Amer. 51, 6B (June 1972), 2030--2043.Google ScholarGoogle Scholar
  117. Renetta Garrison Tull and Janet C. Rutledge. 1996. Automatic speaker recognition based on pitch contours. Proceedings of the Acoustical Society of America 131st Meeting—Lay Language Papers.Google ScholarGoogle Scholar
  118. Lior Uzan and Lior Wolf. 2015. I know that voice: Identifying the voice actor behind the voice. In Proceedings of the International Conference on Biometrics (ICB’15). IEEE, 46--51.Google ScholarGoogle ScholarCross RefCross Ref
  119. Ratree Wayland, Scott Gargash, and Allard Longman. 1995. Acoustic and perceptual investigation of breathy voice. J. Acoustic. Soc. Amer. 97, 5 (May 1995), 3364.Google ScholarGoogle ScholarCross RefCross Ref
  120. Frederik Weber, Linda Manganaro, Barbara Peskin, and Elizabeth Shriberg. 2002. Using prosodic and lexical information for speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 141--144.Google ScholarGoogle Scholar
  121. Carl E. Williams and Kenneth N. Stevens. 1972. Emotions and speech: Some acoustical correlates. J. Acoustic. Soc. Amer. 52, 4B (March 1972), 1238--1250. http://www.ohio.edu/people/leec1/documents/sociophobia/williams_stevens_1972.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  122. Frank Wittig and Christian Mueller. 2003. Implicit feedback for user-adaptive systems by analyzing the user’s speech. In Proceedings of the Workshop on Adaptivität und Benutzermodellierung in interaktiven Softwaresystemen (ABIS’03).Google ScholarGoogle Scholar
  123. Tian Wu, Yingchun Yang, Zhaohui Wu, and Dongdong Li. 2006. MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’06). IEEE. 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  124. Zhizheng Wu and Haizhou Li. 2013. Voice conversion and spoofing attack on speaker verification systems. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA’13). IEEE. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  125. Naoaki Yanagihara. 1967. Significance of harmonic changes and noise components in hoarseness. J. Amer. Speech-Lang.-Hear. Assoc. 10 (Sept. 1967), 531--541.Google ScholarGoogle Scholar
  126. Eiji Yumoto. 1988. Quantitative assessment of the degree of hoarseness. J. Voice 1, 4 (Jan. 1988), 310--313.Google ScholarGoogle ScholarCross RefCross Ref
  127. Eiji Yumoto, Wilbur J. Gould, and Thomas Baer. 1982. Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoustic. Soc. Amer. 71, 6 (June 1982), 1544--1549. http://www.ncbi.nlm.nih.gov/pubmed/7108029Google ScholarGoogle ScholarCross RefCross Ref
  128. Elisabeth Zetterholm. 2003. Voice Imitation: A Phonetic Study of Perceptual Illusions and Acoustic Success. PhD Dissertation. Lund University, Lund, Sweden.Google ScholarGoogle Scholar
  129. Elisabeth Zetterholm. 2006. Same speaker—Different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian International Conference on Speech Science and Technology. 70--75.Google ScholarGoogle Scholar
  130. Elisabeth Zetterholm, Daniel Elenius, and Mats Blomberg. 2004. A comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian International Conference on Speech Science and Technology. Australian Speech Science and Technology Association, Sydney, Australia. 393--397. Retrieved from https://lup.lub.lu.se/search/publication/52907e52-0553-4228-a120-addc5e1f9d24.Google ScholarGoogle Scholar
  131. Cuiling Zhang and Bin Lin. 2017. Acoustic analysis of whispery voice disguise in Chinese. J. Acoustic. Soc. Amer. 141, 5 (2017), 3982--3982.Google ScholarGoogle ScholarCross RefCross Ref
  132. Cuiling Zhang and Tiejun Tan. 2008. Voice disguise and automatic speaker recognition. Forensic Sci. Int. 175, 2--3 (April 2008), 118--122.Google ScholarGoogle Scholar
  133. Sue Anne Zollinger and Henrik Brumm. 2011. The Lombard effect. Curr. Biol. 21, 16 (Aug. 2011), R614--R615.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Voice Disguise in Automatic Speaker Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 51, Issue 4
      July 2019
      765 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3236632
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 July 2018
      • Revised: 1 March 2018
      • Accepted: 1 March 2018
      • Received: 1 September 2016
      Published in csur Volume 51, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • survey
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader