skip to main content
10.1145/3290605.3300828acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Voice Presentation Attack Detection through Text-Converted Voice Command Analysis

Published:02 May 2019Publication History

ABSTRACT

Voice assistants are quickly being upgraded to support advanced, security-critical commands such as unlocking devices, checking emails, and making payments. In this paper, we explore the feasibility of using users' text-converted voice command utterances as classification features to help identify users' genuine commands, and detect suspicious commands. To maintain high detection accuracy, our approach starts with a globally trained attack detection model (immediately available for new users), and gradually switches to a user-specific model tailored to the utterance patterns of a target user. To evaluate accuracy, we used a real-world voice assistant dataset consisting of about 34.6 million voice commands collected from 2.6 million users. Our evaluation results show that this approach is capable of achieving about 3.4% equal error rate (EER), detecting 95.7% of attacks when an optimal threshold value is used. As for those who frequently use security-critical (attack-like) commands, we still achieve EER below 5%.

Skip Supplemental Material Section

Supplemental Material

paper598p.mp4

mp4

648.9 KB

References

  1. Muzhir Shaban Al-Ani, Thabit Sultan Mohammed, and Karim M. Aljebory. 2007. Speaker identifcation: A hybrid approach using neural networks and wavelet transform. Journal of Computer Science 3 (2007), 304--309.Google ScholarGoogle ScholarCross RefCross Ref
  2. Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. 2017. Deep Voice 2: Multi-speaker neural text-to-speech. In Advances in Neural Information Processing Systems 30. NIPS, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bulent Ayhan and Chiman Kwan. 2018. Robust speaker identifcation algorithms and results in noisy environments. In International Symposium on Neural Networks 2018. ISNN, Minsk, 443--450.Google ScholarGoogle Scholar
  4. Logan Blue, Luis Vargas, and Patrick Traynor. 2018. Hello, Is It Me You're Looking For?: Diferentiating Between Human and Electronic Speakers for Voice Interface Security. In Proceedings of the 11th ACM Conference on Security Privacy in Wireless and Mobile Networks. ACM, Stockholm, 123--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jean-Francois Bonastre, Driss Matrouf, and Corinne Fredouille. 2007. Artifcial impostor voice transformation efects on false acceptance rates. In Proc. Interspeech 2007. ISCA, Antwerp, 2053--2056.Google ScholarGoogle ScholarCross RefCross Ref
  6. Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In 25th USENIX Security Symposium. USENIX, Texas. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on Speech-to-Text. In 1st Deep Learning and Security Workshop. IEEE, California.Google ScholarGoogle ScholarCross RefCross Ref
  8. Si Chen, Kui Ren, Sixu Piao, Cong Wang, Qian Wang, Jian Weng, Lu Su, and Aziz Mohaisen. 2017. You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems. IEEE, Georgia, 183-195.Google ScholarGoogle ScholarCross RefCross Ref
  9. Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarGoogle ScholarCross RefCross Ref
  10. Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artifcial Intelligence (IJCAI'01). IJCAI, Washington, 973--978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Serife Kucur Ergunay, Elie Khoury, Alexandros Lazaridis, and Sebastien Marcel. 2015. On the vulnerability of speaker verifcation to realistic voice spoofng. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, Virginia, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  12. Adrienne Porter Felt, Serge Egelman, and David Wagner. 2012. I'Ve Got 99 Problems, but Vibration Ain'T One: A Survey of Smartphone Users' Concerns. In Proceedings of the Second ACM Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM '12). ACM, New York, NY, USA, 33--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Huan Feng, Kassem Fawaz, and Kang G. Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. ACM, Utah, 343--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kuruvachan K. George, C. Santhosh Kumar, Ashish Panda, K. I. Ramachandran, K. Arun Das, and S. Veni. 2015. Minimizing the false alarm probability of speaker verifcation systems for mimicked speech. In 2015 Intl. Conference on Computing and Network Communications. IEEE, Trivandrum, 703--709.Google ScholarGoogle Scholar
  15. David Money Harris and Sarah L. Harris. 2013. Digital Design and Computer Architecture (2nd Ed). Morgan Kaufmann, San Francisco. 129--131 pages.Google ScholarGoogle Scholar
  16. Qin Jin, Arthur R. Toth, Alan W. Black, and Tanja Schultz. 2008. Is voice transformation a threat to speaker identifcation?. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Nevada, 4845--4848.Google ScholarGoogle ScholarCross RefCross Ref
  17. Martin Karu and Tanel Alumae. 2018. Weakly supervised training of speaker identifcation models. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 24-- 30.Google ScholarGoogle ScholarCross RefCross Ref
  18. Tomi Kinnunen, Md Sahidullah, Hector Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofng attack detection. In Proc. Interspeech 2017. ISCA, Stockholm, 2--6.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yee Wah Lau, M. Wagner, and D. Tran. 2004. Vulnerability of speaker verifcation to voice mimicking. In Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing. IEEE, Hong Kong, 145--148.Google ScholarGoogle Scholar
  20. Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio replay attack detection with deep learning frameworks. In Proc. Interspeech 2017. ISCA, Stockholm, 82--86.Google ScholarGoogle ScholarCross RefCross Ref
  21. Bo Li, Tara N. Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak, Golan Punduk, Kean Chin, Khe Chai Sim, Ron J. Weiss, Kevin Wilson, Ehsan Variani, Chanwoo Kim, Olivier Siohan, Mitchel Weintraub, Erik McDermott, Rick Rose, and Matt Shannon. 2017. Acoustic modeling for Google Home. In Proc. Interspeech 2017. ISCA, Stockholm, 399--403.Google ScholarGoogle ScholarCross RefCross Ref
  22. Rui Liu, Cory Cornelius, Reza Rawassizadeh, Ronald Peterson, and David Kotz. 2018. Vocal Resonance: Using Internal Body Voice for Wearable Authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (2018), 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Michael McTear, Zoraida Callejas, and David Griol. 2016. The Conversational Interface: Talking to Smart Devices. Springer, Switzerland. 166--167 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shihono Mochizuki, Sayaka Shiota, and Hitoshi Kiya. 2018. Voice liveness detection using phoneme-based pop-noise detector for speaker verifcation. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 233--239.Google ScholarGoogle ScholarCross RefCross Ref
  25. Monisankha Pal, Dipjyoti Paul, and Goutam Saha. 2018. Synthesis speech detection using fundamental frequency variation and spectral features. Computer Speech and Language 48 (2018), 31--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Saurabh Panjwani and Achintya Prakash. 2014. Crowdsourcing attacks on biometric systems. In Proceedings of the Tenth Symposium On Usable Privacy and Security (SOUPS 2014). USENIX, California, 257--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tanvina B. Patel and Hemant A. Patil. 2015. Combining evidences from mel cepstral, cochlear flter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Proc. Interspeech 2015. ISCA, Dresden, 2062--2066.Google ScholarGoogle Scholar
  28. Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2018. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In 6th International Conference on Learning Representations. ICLR, Vancouver.Google ScholarGoogle Scholar
  29. Douglas A. Reynolds and Richard C. Rose. 1995. Robust textindependent speaker identifcation using Gaussian mixture speaker models. 1995 IEEE Transactions on Speech and Audio Processing 3 (1995), 72--83.Google ScholarGoogle Scholar
  30. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarGoogle ScholarCross RefCross Ref
  31. Roberto Togneri and Daniel Pullella. 2011. An overview of speaker identifcation: Accuracy and Robustness Issues. IEEE Circuits and Systems Magazine, 09 June 2011 11 (2011), 23--61.Google ScholarGoogle Scholar
  32. Francis Tom, Mohit Jain, and Prasenjit Dey. 2018. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Proc. Interspeech 2018. ISCA, Hyderabad, 681--685.Google ScholarGoogle ScholarCross RefCross Ref
  33. Jesus Villalba, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. 2015. Spoofng detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2067--2071.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonhui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurou. 2017. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech 2017. ISCA, Stockholm, 4006--4010.Google ScholarGoogle ScholarCross RefCross Ref
  35. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md. Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: The frst automatic speaker verifcation spoofng and countermeasures challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2037--2041.Google ScholarGoogle ScholarCross RefCross Ref
  36. Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authenticatio. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, Texas, 57--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiaojia Zhao, Yuxuan Wang, and DeLiang Wang. 2014. Robust speaker identifcation in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (2014), 836--845. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Voice Presentation Attack Detection through Text-Converted Voice Command Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
        May 2019
        9077 pages
        ISBN:9781450359702
        DOI:10.1145/3290605

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 May 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CHI '19 Paper Acceptance Rate703of2,958submissions,24%Overall Acceptance Rate6,199of26,314submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format