Abstract
Humans are able to identify other people’s voices even in voice disguise conditions. However, we are not immune to all voice changes when trying to identify people from voice. Likewise, automatic speaker recognition systems can also be deceived by voice imitation and other types of disguise. Taking into account the voice disguise classification into the combination of two different categories (deliberate/non-deliberate and electronic/non-electronic), this survey provides a literature review on the influence of voice disguise in the automatic speaker recognition task and the robustness of these systems to such voice changes. Additionally, the survey addresses existing applications dealing with voice disguise and analyzes some issues for future research.
- Kanae Amino, Hisanori Makinae, and Toshiaki Kamada. 2018. Auditory discrimination of natural speech and synthetic speech used as voice disguise. Acoustic. Sci. Technol. 39, 1 (2018), 48--50.Google ScholarCross Ref
- Walter D. Andrews, Mary A. Kohler, Joseph P. Campbell, John J. Godfrey, and Jaime Hernández-Cordero. 2002. Gender-dependent phonetic refraction for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 149--152.Google Scholar
- Bishner Saroop Atal. 1972. Automatic speaker recognition based on pitch contours. J. Acoustic. Soc. Amer. 52, 6B (Dec. 1972), 1687--1697.Google ScholarCross Ref
- Katarina Bartokva, David Le-Gac, Delphine Jauvet, and Denis Jouvet. 2002. Prosodic parameter for speaker identification. In Proceedings of the 7th International Conference on Spoken Language Processing. 1197--1200.Google Scholar
- Jacob Benesty, Shoji Makino, and Jingdong Chen (Eds.). 2005. Speech Enhancement. Springer.Google Scholar
- Richard H. Bolt, Franklin S. Cooper, Edward E. David Jr., Peter B. Denes, James M. Pickett, and Kenneth N. Stevens. 1969. Identification of a speaker by speech spectrograms. Science 166, 3903 (Oct. 1969), 338--342.Google ScholarCross Ref
- Markus Bruckl and Walter F. Sendlmeier. 2003. Aging female voices: An acoustic and perceptive analysis. In Proceedings of the Conference on Voice Quality (VOQUAL’03). 163--168.Google Scholar
- Janet E. Cahn. 1990. The generation of affect in synthesized speech. J. American Voice I/O Soc. 8 (1990), 1--9.Google Scholar
- Joseph P. Campbell. 1997. Speaker recognition: A tutorial. Proc. IEEE 85 (Sept. 1997), 1437--1462. Retrieved from http://ieeexplore.ieee.org/xpl/login.jsp?tp&equal;8arnumber&equal;628714.Google ScholarCross Ref
- Michael J. Carey, Eluned S. Parris, Harvey Lloyd-Thomas, and Stephen Bennett. 1996. Robust prosodic features for speaker identification. In Proceedings of the 4th International Conference on Spoken Language Processing. 800--1803.Google ScholarCross Ref
- Rolf Carlson, Bjorn Granstrom, and Lennart Nord. 1992. Experiments with emotive speech, acted utterances and synthesized replicas. Speech Commun. 11, 1 (March 1992), 347--355.Google Scholar
- Li Chen and Yingchun Yang. 2011. Applying emotional factor analysis and I-vector to emotional speaker recognition. In Proceedings of the 6th Chinese Conference on Biometric Recognition (CCBR’11) (Lecture Notes in Computer Science), Zhenan Sun, Jianhuang Lai, and Xilin Chen Tieniu Tan (Eds.). Springer, Berlin. 174--179. Google ScholarDigital Library
- Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans. Audio, Speech Lang. Process. 22, 12 (Dec. 2014), 1859--1872. Google ScholarDigital Library
- Sharada V. Chougule and Mahesh S. Chavan. 2015. Robust spectral features for automatic speaker recognition in mismatch condition. In Proceedings of the 2nd International Symposium on Computer Vision and the Internet (VisionNet’15), Vol. 58. Elsevier. 272--279.Google Scholar
- Jessica Clark and Paul Foulkes. 2007. Identification of voices in electronically disguised speech. Int. J. Speech Lang. Law 14, 2 (Dec. 2007).Google Scholar
- Christophe d’Alessandro. 2006. Voice source parameters and prosodic analysis. In Language Context and Cognition. Methods in Empirical Prosody Research, Anita Steube (Ed.). Walter de Gruyter, Berlin/New York, 63--88.Google Scholar
- Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 4 (Aug. 1980), 357--366.Google ScholarCross Ref
- Najim Dehak. 2009. Discriminative and Generative Approaches for Long- and Short-Term Speaker Characteristics Modeling: Application to Speaker Verification. PhD Dissertation. École de Technologie Supérieure, Montréal, Canada. Google ScholarDigital Library
- Najim Dehak, Reda Dehak, J. Glass, Douglas Reynolds, and Patrick Kenny. 2010. Cosine similarity scoring without score normalization techniques. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 71--75.Google Scholar
- Véronique Delvaux, Lise Caucheteux, Kathy Huet, Myriam Piccaluga, and Bernard Harmegnies. 2017. Voice disguise vs. impersonation: Acoustic and perceptual measurements of vocal flexibility in non experts. Proceedings of the Interspeech Conference. 3777--3781.Google ScholarCross Ref
- George Doddington. 2001. Speaker recognition based on idiolectal differences between speakers. In Proceedings of the Eurospeech Conference, Vol. 4. 2521--2524.Google Scholar
- Helenca Duxans. 2006. Voice Conversion Applied to Text-to-Speech Systems. PhD Dissertation. Universitat Politècnica de Catalunya, Department od Signal Processing and Communications, Barcelona, Catalonia.Google Scholar
- Anders Eriksson and Par Wretling. 1997. How flexible is the human voice? - A case study of mimicry. In Proceedings of the Eurospeech Conference. ISCA. 1043--1046. Retrieved from http://www.ling.gu.se/∼anders/papers/a1008.pdf.Google Scholar
- Carol Y. Espy-Wilson, Sandeep Manocha, and Srikanth Vishnubhotla. 2006. A new set of features for text-independent speaker identification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 1475--1478. Retrieved from http://www.isr.umd.edu/Labs/SCL/publications/conference/espy_manocha_vish_icslp_06.pdf.Google ScholarCross Ref
- Gunnar Fant. 1960. Acoustic Theory of Speech Production: With Calculations Based on X-ray Studies of Russian Articulations. Mouton and Co., The Hague, Netherlands.Google Scholar
- Mireia Farrús. 2008. Fusing Prosodic and Acoustic Information for Speaker Recognition. PhD Dissertation. Universitat Politècnica de Catalunya, Barcelona, Catalonia.Google Scholar
- Mireia Farrús, Erik Eriksson, Kirk P. H. Sullivan, and Javier Hernando. 2006a. Dialect imitations in speaker recognition. In Proceedings of the European IAFL Conference on Forensic Linguistics, Language and the Law. 347--353.Google Scholar
- Mireia Farrús, Ainara Garde, Pascual Ejarque, Jordi Luque, and Javier Hernando. 2006b. On the fusion of prosody, voice spectrum and face features for multimodal person verification. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’06). 2106--2109.Google ScholarCross Ref
- Mireia Farrús and Javier Hernando. 2009. Using jitter and shimmer in speaker verification. IET Signal Process. 3, 4 (July 2009), 247--257.Google ScholarCross Ref
- Mireia Farrús, Javier Hernando, and Pascual Ejarque. 2007. Jitter and shimmer measurements for speaker recognition. In Proceedings of the 8th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008a. How vulnerable are prosodic features to professional imitators? In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’08).Google Scholar
- Mireia Farrús, Michael Wagner, Jan Anguita, and Javier Hernando. 2008b. Robustness of prosodic features to voice imitation. In Proceedings of the Interspeech Conference.Google ScholarCross Ref
- Mireia Farrús, Michael Wagner, Daniel Erro, and Havier Hernando. 2010. Automatic speaker recognition as a measurement of voice imitation and conversion. Int. J. Speech Lang. Law 1, 17 (2010), 980--988.Google Scholar
- Carole T. Ferrand. 2002. Harmonics-to-noise ratio: An index of vocal aging. J. Voice 16, 4 (Dec. 2002), 480--487.Google ScholarCross Ref
- Mohamed Fezari, Fethi Amara, and Ibrahim M. M. El-Emary. 2014. Acoustic analysis for detection of voice disorders using adaptive features and classifiers. In Proceedings of the International Conference on Circuits, Systems and Control. 112--117.Google Scholar
- James L. Flanagan. 1972. Speech Analysis, Synthesis and Perception. Springer, Berlin.Google Scholar
- Corinne Fredouille, Gilles Pouchoulin, Jean-Franois Bonastre, Marion Azzarello, Antoine Giovanni, and Alain Ghio. 2005. Application of automatic speaker recognition techniques to pathological voice assessment (dysphonia). In Proceedings of the Interspeech Conference. ISCA, 149--152.Google ScholarCross Ref
- Marius Vasile Ghiurcau, Corneliu Rusu, and Jaakko Astola. 2011. A study of the effect of emotional state upon text-independent speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’11). IEEE. 4944--4947.Google ScholarCross Ref
- Herbert Gish and Michael Schmidt. 1994. Text-independent speaker identification. IEEE Signal Process. Mag. 11, 4 (Oct. 1994), 18--32. Retrieved from http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp&equal;8arnumber&equal;317924.Google ScholarCross Ref
- Christer Gobl and Ailbhe Ní Chasaide. 2003. The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 40, 1--2 (April 2003), 189--212. Google ScholarDigital Library
- Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, and Anne-Maria Laukkanen. 2015. Automatic versus human speaker verification: The case of voice mimicry. Speech Commun. 72 (May 2015), 13--31.Google Scholar
- Rosa González-Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, and Anne-Maria Laukkanen. 2013. I-vectors meet imitators: On vulnerability of speaker verification systems against voice mimicry. In Proceedings of the Interspeech Conference. ISCA. 930--934.Google Scholar
- Rosa González-Hautamäki, Md Sahidullah, Ville Hautamäki, and Tomi Kinnunen. 2017. Acoustical and perceptual study of voice disguise by age modification in speaker verification. Speech Commun. 95 (2017), 1--15. Google ScholarDigital Library
- Nate Halloran. 2003. The Acquisition of a Stage Dialect. Master’s thesis. Portland State University, Portland, OR.Google Scholar
- David E. Hartman. 1979. The perceptual identity and characteristics of aging in normal male adult speakers. J. Commun. Disord. 12, 1 (Feb. 1979), 53--61.Google ScholarCross Ref
- Hynek Hermansky. 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoustic. Soc. Amer. 87, 4 (Aug. 1990), 1738--1752.Google ScholarCross Ref
- Harry Hollien, Gea DeJong, Camilo A. Martin, R. Schwartz, and Kristen Liljegren. 2001a. Effects of ethanol intoxication on speech suprasegmentals. J. Acoustic. Soc. Amer. 110, 6 (Dec. 2001), 3198--206.Google ScholarCross Ref
- Harry Hollien, Kristen Liljegren, Camilo A. Martin, and Gea DeJong. 2001b. Production of intoxication states by actorsacoustic and temporal characteristics. J. Forensic Sci. 46, 1 (Feb. 2001), 68--73.Google ScholarCross Ref
- John Paul Hosom, Alexander B. Kain, Taniya Mishra, Jan P. H. Van Santen, Melanie Fried-Oken, and Janice Staehely. 2003. Intelligibility of modifications to dysarthric speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 1. IEEE. 924--927.Google ScholarCross Ref
- Mark Huckvale and Anne-Linn Kristiansen. 2012. Effectiveness of electronic voice disguise between friends. In Proceedings of the 46th International Conference: Audio Forensics. Retrieved from http://www.aes.org/e-lib/browse.cfm?elib&equal;16337.Google Scholar
- Tom Johnstone. 2001. The Effect of Emotion on Voice Production and Speech Acoustics. PhD Dissertation. University of Western Australia, Psychology Department, Perth, Australia.Google Scholar
- Tom Johnstone and Klaus R. Scherer. 1999. The effects of emotions on voice quality. In Proceedings of the 14th International Conference of Phonetic Sciences. 2029--2032. Retrieved from http://www.keck.waisman.wisc.edu/∼tjohnstone/0602.pdf.Google Scholar
- Sachin S. Kajarekar, Harry Bratt, Elizabeth Shriberg, and Rafael De León. 2006. A study of intentional voice modifications for evading automatic speaker recognition. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’06). ISCA. 1--6.Google ScholarCross Ref
- Ahilan Kanagasundaram, Robbie Vogt, David Dean, and Michael Mason. 2011. i-vector based speaker recognition on short utterances. In Proceedings of the Interspeech Conference. ISCA. 2341--2344.Google ScholarCross Ref
- Harleen Kaur. 2017. Speaker Identification of Disguised Voices Using MFCC Statistical Moment And SVM Classifier. Ph.D. Dissertation. Thapar Institute of Engineering 8 Technology, Patiala, India.Google Scholar
- Finnian Kelly, Rahim Saeidi, Naomi Harte, and David van Leeuwen. 2014. Effect of long-term ageing on i-vector speaker verification. In Proceedings of the Interspeech Conference. International Speech Communication Association. 86--90. Retrieved from http://www.mee.tcd.ie/∼sigmedia/pmwiki/uploads/Main.Publications/finnian_interspeech14.pdf.Google ScholarCross Ref
- Patrick Kenny, Pierre Ouellet, Najim Dehak, Vishwa Gupta, and Pierre Dumouchel. 2008. A study of inter-speaker variability in speaker verification. IEEE Trans. Audio, Speech, Lang. Process. 16, 5 (2008), 980--988. Google ScholarDigital Library
- Lawrence G. Kersta. 1962. Voiceprint identification. Nature 4861 (Dec. 1962), 1253--1257.Google Scholar
- Tomi Kinnunen and Paavo Alku. 2009. On separating glottal source and vocal tract information in telephony speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 4545--4548. Google ScholarDigital Library
- Tatsuya Kitamura. 2008. Acoustic analysis of imitated voice produced by a professional impersonator. In Proceedings of the Interspeech Conference. ISCA. 813--816.Google ScholarCross Ref
- Fritz Klingholz, R. Penning, and E. Liebhardt. 1988. Recognition of low-level alcohol intoxication from speech signal. J. Acoustic. Soc. Amer. 84, 3 (Sept. 1988), 929--935.Google ScholarCross Ref
- Hisayoshi Kojima, Wilbur J. Gould, Anthony Lambiase, and Nobuhiko Isshiki. 1982. Computer analysis of hoarseness. Acta Oto-laryngologica 89, 3--6 (Jan. 1982), 547--554.Google Scholar
- Jody Kreiman and Bruce R. Gerratt. 2005. Perception of aperiodicity in pathological voice. J. Acoustic. Soc. Amer. 117, 4 (May 2005), 2201--2211. http://www.ncbi.nlm.nih.gov/pubmed/15898661Google ScholarCross Ref
- Hermann J. Künzel. 2000. Effects of voice disguise on speaking fundamental frequency. Forensic Linguist. 7, 2 (Dec. 2000), 149--179.Google Scholar
- Hermann J. Künzel, Joaquín González-Rodríguez, and Javier Ortega-García. 2004. Effect of voice disguise on the performance of a forensic automatic speaker recognition system. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’04). ISCA. 153--156. Retrieved from http://www.isca-speech.org/archive_open/odyssey_04/ody4_153.html.Google Scholar
- Yee W. Lau, Dat Tran, and Michael Wagner. 2004. Vulnerability of speaker verification to voice mimicking. In Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing. 145--148.Google Scholar
- Yee W. Lau, Dat Tran, and Michael Wagner. 2005. Testing voice mimicry with the YOHO speaker verification corpus. In Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems (Lecture Notes in Computer Science), Vol. 3684. Springer. 15--20. Google ScholarDigital Library
- John Laver. 1994. Principles of Phonetics. Cambridge University Press, Cambridge.Google Scholar
- Xi Li, Jidong Tao, Michael T. Johnson, Joseph Soltis, Anne Savage, Kirsten M. Leong, and John D. Newman. 2005. Stress and emotion classification using jitter and shimmer features. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), Vol. 4. 1081--1084.Google Scholar
- Johan Lindberg and Mats Blomberg. 1999. Vulnerability in speaker verification. A study of technical impostor techniques. In Proceedings of the Eurospeech Conference. 1211--1214.Google Scholar
- Sue Ellen Linville. 2001. Vocal Aging. Singular Publishing Group, San Diego.Google Scholar
- Robert C. Lummis and Aaron E. Rosenberg. 1972. Test of an automatic speaker verification method with intensively trained professional mimics. J. Acoustic. Soc. Amer. 51, 131 (Jan. 1972).Google ScholarCross Ref
- Evangeline Machlin. 1975. Dialects for the Stage. Routledge/Theater Arts, New York.Google Scholar
- John Makhoul. 1975. Linear prediction: A tutorial review. Proc. IEEE 53, 4 (April 1975), 561--580.Google Scholar
- Duncan Markham. 1997. Phonetic Imitation, Accent, and the Learner. PhD Dissertation. Lund University, Lund, Sweden.Google Scholar
- Judith A. Markowitz. 1996. Using Speech Recognition. Prentice Hall PTR, Upper Saddle River, N.J.Google Scholar
- Judith A. Markowitz. 2007. The many roles of speaker classification in speaker verification and identification. In Speaker Classification I, Christian Mueller (Ed.). Springer, Berlin. 218--225. Google ScholarDigital Library
- Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro Shikano, and Nick Campbell. 2002. Evaluation of cross-language voice conversion using bilingual and non-bilingual databases. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’02). 293--296.Google Scholar
- Mikiko Mashimo, Tomoki Toda, Kiyohiro Shikano, and Nick Campbell. 2001. Evaluation of cross-language voice conversion based on GMM and STRAIGHT. In Proceedings of the Eurospeech Conference. 361--364.Google Scholar
- Takashi Masuko, Keiichi Tokuda, and Takao Kobayashi. 2000. Imposture using synthetic speech against speaker verification based on spectrum and pitch. In Proceedings of the International Conference on Spoken Language Processing (ICSLP’00), Vol. 2. 302--305.Google Scholar
- Driss Matrouf, Jean-François Bonastre, and Corinne Fredouille. 2006. Effect of speech transformation on impostor acceptance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’06), Vol. 1. 933--936.Google ScholarCross Ref
- Yuri Matveev. 2013. The problem of voice template aging in speaker recognition systems. In Proceedings of the 15th International Conference on Speech and Computer (SPECOM’13), Miloš Železný, Ivan Habernal, and Andrey Ronzhin (Eds.). Lecture Notes in Computer Science, Vol. 8113. Springer International Publishing. 345--353. Google ScholarDigital Library
- Florian Metze, Jitendra Ajmera, Roman Englert, Udo Bub, Felix Burkhardt, Joachim Stegmann, Christian Muller, Richard Huber, Bernt Andrassy, Josef G. Bauer, and Bernhard Littel. 2007. Comparison of four approaches to age and gender recognition for telephone applications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’07), Vol. 4. IEEE, 1089--1092. Retrieved from http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?arnumber&equal;4218294.Google ScholarCross Ref
- Dirk Michaelis, Matthias Frohlich, Hans Werner Strube, Eberhard Kruse, Brad Story, and Ingo R. Titze. 1998. Some simulations concerning jitter and shimmer measurement. In Proceedings of the International Workshop on Advances in Quantitative Laryngoscopy. 744--754. Retrieved from http://www.dpi.physik.uni-goettingen.de/∼micha/aachen98/aachen98.html.Google Scholar
- Seyed Hamidreza Mohammadi and Alexander Kain. 2014. Voice conversion using deep neural networks with speaker-independent pre-training. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’14). IEEE, 19--23.Google ScholarCross Ref
- Iain R. Murray and John L. Arnott. 1993. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoustic. Soc. Amer. 93, 2 (Feb. 1993), 1097--1108.Google ScholarCross Ref
- Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Voice conversion in high-order eigen space using deep belief nets. In Proceedings of the Interspeech Conference. 369--372.Google ScholarCross Ref
- M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009a. Effect of noise-in-speech on MFCC parameters. In Proceedings of the 9th WSEAS International Conference on Signal, Speech and Image Processing, and 9th WSEAS International Conference on Multimedia, Internet and Video Technologies. ACM. 39--43. Google ScholarDigital Library
- M. Laxmi Narayana and Sunil Kumar Kopparapu. 2009b. On the use of stress information in speech for speaker recognition. In Proceedings of the IEEE Region 10 Conference (TENCON’09). IEEE. 1--4.Google ScholarCross Ref
- Barbara Peskin, Jiri Navrátil, Joy Abramson, Doug Jones, David Klusáček, Douglas A. Reynolds, and Bing Xiang. 2003. Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS02. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 792--795.Google ScholarCross Ref
- Jeff Pittam. 1994. Voice in Social Interaction; An Interdisciplinary Approach. SAGE Publications, Thousand Oaks.Google Scholar
- Manfred Putzer and Jacques Koreman. 1997. A german database of patterns for vocal fold vibration. Phonus 3, Institute of Phonetics, University of Saarland. 143--153.Google Scholar
- Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall, Inc., Englewood Cliffs, NJ. Google ScholarDigital Library
- Alan R. Reich. 1981. Detecting the presence of vocal disguise in the male voice. J. Acoustic. Soc. Amer. 69, 5 (July 1981), 1458--1460.Google ScholarCross Ref
- Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2002. Exploiting High-level Information for High-performance Speaker Recognition. SuperSID Project Final Report. MIT Lincoln Laboratory, US Department of Defense, IBM, International Computer Science Institute, Oregon Graduate Institute, Carnegie Mellon University, Charles University, York University, Princeton University, Cornell University, Baltimore, MD.Google Scholar
- Douglas A. Reynolds, Walter D. Andrews, Joseph Campbell, Jiri Navrátil, Barbara Peskin, André Adami, Qin Jin, David Klusáček, Joy Abramson, Radu Mihaescu, Jack Godfrey, Doug Jones, and Bing Xiang. 2003. The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’03), Vol. 4. IEEE. 784--787.Google ScholarCross Ref
- Douglas A. Reynolds, Marc A. Zissman, Thomas F. Quatieri, and Gerald C. OLeary. 1995. The effects of telephone transmission degradations on speaker recognition performance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’95). IEEE. 329--332.Google Scholar
- Robert D. Rodman. 1998. Speaker recognition of disguised voices: A program for research. In Proceedings of the Consortium on Speech Technology Conference on Speaker Recognition by Man and Machine: Directions for Forensic Applications. 9--22.Google Scholar
- Robert D. Rodman and Michael S. Powell. 2000. Computer recognition of speakers who disguise their voice. In Proceedings of the International Conference on Signal Processing Applications 8 Technology.Google Scholar
- William J. Ryan and Kenneth W. Burk. 1974. Perceptual and acoustic correlates of aging in the speech of males. J. Commun. Disord. 7, 2 (June 1974), 181--192.Google ScholarCross Ref
- Nicolas Scheffer, jean François Bonastre, Alain Ghio, and Bernard Teston. 2001. Gémellité et reconnaissance automatique du locuteur. In Proceedings of the 25th Journées d’Etude sur la Parole (Lecture Notes in Computer Science). 445--448. Retrieved from https://hal.archives-ouvertes.fr/hal-00134198.Google Scholar
- Klaus R. Scherer. 1986. Vocal affect expression: A review and a model for future research. Psychol. Bull. 99, 2 (March 1986), 143--65. Retrieved from http://www.affective-sciences.org/system/files/biblio/1986_Scherer_PsyBull.pdf.Google ScholarCross Ref
- Klaus R. Scherer, Robert D. Ladd, and Kim E. A. Silverman. 1984. Vocal cues to speaker affect: Testing two models. J. Acoustic. Soc. Amer. 76, 5 (June 1984), 1346--1356. Retrieved from http://www.affective-sciences.org/system/files/biblio/1984_Scherer_JASA.pdf.Google ScholarCross Ref
- Astrid Schmidt-Nielsen and Thomas H. Crystal. 2000. Speaker verification by human listeners: Experiments comparing human and machine performance using the NIST 1998 speaker evaluation data. Digital Signal Process. 10, 1--3 (Jan. 2000), 249--266. Google ScholarDigital Library
- Susanne Schoetz. 2007. Acoustic analysis of adult speaker age. In Speaker Classification I, Christian Mueller (Ed.). Vol. 4343. Springer, Berlin. 88--107. Google ScholarDigital Library
- Stephen Shum, Najim Dehak, Reda Dehak, and James R. Glass. 2010. Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification. In Proceedings of the the Speaker and Language Recognition Workshop (ODYSSEY’10). ISCA. 76--82.Google Scholar
- Roger W. Shuy. 1990. Dialect as evidence in law cases. J. English Linguist. 23, 1 (April 1990), 195--208.Google ScholarCross Ref
- Milan Sigmund. 2008. Automatic speaker recognition by speech signal. In Frontiers in Robotics, Automation and Control, Alexander Zemliak (Ed.). InTech.Google Scholar
- Kemal Sonmez, Elizabeth Shriberg, Larry P. Heck, and Elizabeth Weintraub. 1998. Modeling dynamic prosodic variation for speaker verification. In Proceedings of the 5th International Conference on Spoken Language Processing, Vol. 7. 3189--3192.Google Scholar
- Kenneth N. Stevens, Carl E. Williams, Jaime R. Carbonell, and Barbara Woods. 1968. Speaker authentication and identification: A comparison of spectrographic and auditory presentations of speech material. J. Acoustic. Soc. Amer. 44, 6 (Dec. 1968), 1596--1607.Google ScholarCross Ref
- Lucian Sulica. 2011. Hoarseness. Arch. Otolaryngol. Head Neck Surg. 137, 6 (June 2011), 616--619.Google ScholarCross Ref
- Kirk P. H. Sullivan and Jason Pelecanos. 2001. Revisiting carl bildts impostor: Would a speaker verification system foil him? In Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication (Lecture Notes in Computer Science), Vol. 2091. Springer. 144--149. Google ScholarDigital Library
- Bradford L. Swartz. 1992. Resistance of voice onset time variability to intoxication. Percept. Motor Skills 75, 2 (Oct. 1992), 415--424.Google ScholarCross Ref
- Tiejun Tan. 2010. The effect of voice disguise on automatic speaker recognition. In Proceedings of the 3rd International Congress on Image and Signal Processing (CISP’10). IEEE. 3538--3541.Google ScholarCross Ref
- Shahrukh K. Taseer. 2005. Speaker identification for speakers with deliberately disguised voices using glottal pulse information. In Proceedings of the Pakistan Section Multitopic Conference. IEEE. 1--5.Google ScholarCross Ref
- Oscar Tosi, Herbert Oyer, William Lashbrook, Charles Pedrey, Julie Nicol, and Ernest Nash. 1972. Experiment on voice identification. J. Acoustic. Soc. Amer. 51, 6B (June 1972), 2030--2043.Google Scholar
- Renetta Garrison Tull and Janet C. Rutledge. 1996. Automatic speaker recognition based on pitch contours. Proceedings of the Acoustical Society of America 131st Meeting—Lay Language Papers.Google Scholar
- Lior Uzan and Lior Wolf. 2015. I know that voice: Identifying the voice actor behind the voice. In Proceedings of the International Conference on Biometrics (ICB’15). IEEE, 46--51.Google ScholarCross Ref
- Ratree Wayland, Scott Gargash, and Allard Longman. 1995. Acoustic and perceptual investigation of breathy voice. J. Acoustic. Soc. Amer. 97, 5 (May 1995), 3364.Google ScholarCross Ref
- Frederik Weber, Linda Manganaro, Barbara Peskin, and Elizabeth Shriberg. 2002. Using prosodic and lexical information for speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP’02), Vol. 1. IEEE. 141--144.Google Scholar
- Carl E. Williams and Kenneth N. Stevens. 1972. Emotions and speech: Some acoustical correlates. J. Acoustic. Soc. Amer. 52, 4B (March 1972), 1238--1250. http://www.ohio.edu/people/leec1/documents/sociophobia/williams_stevens_1972.pdf.Google ScholarCross Ref
- Frank Wittig and Christian Mueller. 2003. Implicit feedback for user-adaptive systems by analyzing the user’s speech. In Proceedings of the Workshop on Adaptivität und Benutzermodellierung in interaktiven Softwaresystemen (ABIS’03).Google Scholar
- Tian Wu, Yingchun Yang, Zhaohui Wu, and Dongdong Li. 2006. MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In Proceedings of the Speaker and Language Recognition Workshop (ODYSSEY’06). IEEE. 1--5.Google ScholarCross Ref
- Zhizheng Wu and Haizhou Li. 2013. Voice conversion and spoofing attack on speaker verification systems. In Proceedings of the Signal and Information Processing Association Annual Summit and Conference (APSIPA’13). IEEE. 1--9.Google ScholarCross Ref
- Naoaki Yanagihara. 1967. Significance of harmonic changes and noise components in hoarseness. J. Amer. Speech-Lang.-Hear. Assoc. 10 (Sept. 1967), 531--541.Google Scholar
- Eiji Yumoto. 1988. Quantitative assessment of the degree of hoarseness. J. Voice 1, 4 (Jan. 1988), 310--313.Google ScholarCross Ref
- Eiji Yumoto, Wilbur J. Gould, and Thomas Baer. 1982. Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoustic. Soc. Amer. 71, 6 (June 1982), 1544--1549. http://www.ncbi.nlm.nih.gov/pubmed/7108029Google ScholarCross Ref
- Elisabeth Zetterholm. 2003. Voice Imitation: A Phonetic Study of Perceptual Illusions and Acoustic Success. PhD Dissertation. Lund University, Lund, Sweden.Google Scholar
- Elisabeth Zetterholm. 2006. Same speaker—Different voices. A study of one impersonator and some of his different imitations. In Proceedings of the 11th Australian International Conference on Speech Science and Technology. 70--75.Google Scholar
- Elisabeth Zetterholm, Daniel Elenius, and Mats Blomberg. 2004. A comparison between human perception and a speaker verification system score of a voice imitation. In Proceedings of the 10th Australian International Conference on Speech Science and Technology. Australian Speech Science and Technology Association, Sydney, Australia. 393--397. Retrieved from https://lup.lub.lu.se/search/publication/52907e52-0553-4228-a120-addc5e1f9d24.Google Scholar
- Cuiling Zhang and Bin Lin. 2017. Acoustic analysis of whispery voice disguise in Chinese. J. Acoustic. Soc. Amer. 141, 5 (2017), 3982--3982.Google ScholarCross Ref
- Cuiling Zhang and Tiejun Tan. 2008. Voice disguise and automatic speaker recognition. Forensic Sci. Int. 175, 2--3 (April 2008), 118--122.Google Scholar
- Sue Anne Zollinger and Henrik Brumm. 2011. The Lombard effect. Curr. Biol. 21, 16 (Aug. 2011), R614--R615.Google ScholarCross Ref
Index Terms
- Voice Disguise in Automatic Speaker Recognition
Recommendations
Detection of Speaker Characteristics Using Voice Imitation
Speaker Classification IIWhen recognizing a voice we attend to particular features of the person's speech and voice. Through voice imitation it is possible to investigate which aspects of the human voice need to be altered to successfully mislead the listener. This suggests ...
Articulation During Voice Disguise: A Pilot Study
Speech and ComputerAbstractSpeakers can conceal their identity by deliberately changing their speech characteristics, or disguising their voices. During voice disguise, speakers alter their normal movements of the articulators, such as tongue positions, according to a ...
Voice conversion by mapping the speaker-specific features using pitch synchronous approach
The basic goal of the voice conversion system is to modify the speaker-specific characteristics, keeping the message and the environmental information contained in the speech signal intact. Speaker characteristics reflect in speech at different levels, ...
Comments