research-article

Voice Presentation Attack Detection through Text-Converted Voice Command Analysis

Authors:
Il-Youp Kwak

Samsung Research, Seoul, Rebublic of Korea

Samsung Research, Seoul, Rebublic of Korea
View Profile

,
Jun Ho Huh

Samsung Research, Seoul, Rebublic of Korea

Samsung Research, Seoul, Rebublic of Korea
View Profile

,
Seung Taek Han

Samsung Research, Seoul, Rebublic of Korea

Samsung Research, Seoul, Rebublic of Korea
View Profile

,
Iljoo Kim

Samsung Research, Seoul, Rebublic of Korea

Samsung Research, Seoul, Rebublic of Korea
View Profile

,
Jiwon Yoon

Korea University, Seoul, Rebublic of Korea

Korea University, Seoul, Rebublic of Korea
View Profile

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing SystemsMay 2019Paper No.: 598Pages 1–12https://doi.org/10.1145/3290605.3300828

Published:02 May 2019Publication History

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Pages 1–12

ABSTRACT

Voice assistants are quickly being upgraded to support advanced, security-critical commands such as unlocking devices, checking emails, and making payments. In this paper, we explore the feasibility of using users' text-converted voice command utterances as classification features to help identify users' genuine commands, and detect suspicious commands. To maintain high detection accuracy, our approach starts with a globally trained attack detection model (immediately available for new users), and gradually switches to a user-specific model tailored to the utterance patterns of a target user. To evaluate accuracy, we used a real-world voice assistant dataset consisting of about 34.6 million voice commands collected from 2.6 million users. Our evaluation results show that this approach is capable of achieving about 3.4% equal error rate (EER), detecting 95.7% of attacks when an optimal threshold value is used. As for those who frequently use security-critical (attack-like) commands, we still achieve EER below 5%.

Supplemental Material

paper598p.mp4

mp4

648.9 KB

Download

References

Muzhir Shaban Al-Ani, Thabit Sultan Mohammed, and Karim M. Aljebory. 2007. Speaker identifcation: A hybrid approach using neural networks and wavelet transform. Journal of Computer Science 3 (2007), 304--309.Google ScholarCross Ref
Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. 2017. Deep Voice 2: Multi-speaker neural text-to-speech. In Advances in Neural Information Processing Systems 30. NIPS, California. Google ScholarDigital Library
Bulent Ayhan and Chiman Kwan. 2018. Robust speaker identifcation algorithms and results in noisy environments. In International Symposium on Neural Networks 2018. ISNN, Minsk, 443--450.Google Scholar
Logan Blue, Luis Vargas, and Patrick Traynor. 2018. Hello, Is It Me You're Looking For?: Diferentiating Between Human and Electronic Speakers for Voice Interface Security. In Proceedings of the 11th ACM Conference on Security Privacy in Wireless and Mobile Networks. ACM, Stockholm, 123--133. Google ScholarDigital Library
Jean-Francois Bonastre, Driss Matrouf, and Corinne Fredouille. 2007. Artifcial impostor voice transformation efects on false acceptance rates. In Proc. Interspeech 2007. ISCA, Antwerp, 2053--2056.Google ScholarCross Ref
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In 25th USENIX Security Symposium. USENIX, Texas. Google ScholarDigital Library
Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on Speech-to-Text. In 1st Deep Learning and Security Workshop. IEEE, California.Google ScholarCross Ref
Si Chen, Kui Ren, Sixu Piao, Cong Wang, Qian Wang, Jian Weng, Lu Su, and Aziz Mohaisen. 2017. You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems. IEEE, Georgia, 183-195.Google ScholarCross Ref
Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarCross Ref
Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artifcial Intelligence (IJCAI'01). IJCAI, Washington, 973--978. Google ScholarDigital Library
Serife Kucur Ergunay, Elie Khoury, Alexandros Lazaridis, and Sebastien Marcel. 2015. On the vulnerability of speaker verifcation to realistic voice spoofng. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, Virginia, 1--8.Google ScholarCross Ref
Adrienne Porter Felt, Serge Egelman, and David Wagner. 2012. I'Ve Got 99 Problems, but Vibration Ain'T One: A Survey of Smartphone Users' Concerns. In Proceedings of the Second ACM Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM '12). ACM, New York, NY, USA, 33--44. Google ScholarDigital Library
Huan Feng, Kassem Fawaz, and Kang G. Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. ACM, Utah, 343--355. Google ScholarDigital Library
Kuruvachan K. George, C. Santhosh Kumar, Ashish Panda, K. I. Ramachandran, K. Arun Das, and S. Veni. 2015. Minimizing the false alarm probability of speaker verifcation systems for mimicked speech. In 2015 Intl. Conference on Computing and Network Communications. IEEE, Trivandrum, 703--709.Google Scholar
David Money Harris and Sarah L. Harris. 2013. Digital Design and Computer Architecture (2nd Ed). Morgan Kaufmann, San Francisco. 129--131 pages.Google Scholar
Qin Jin, Arthur R. Toth, Alan W. Black, and Tanja Schultz. 2008. Is voice transformation a threat to speaker identifcation?. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Nevada, 4845--4848.Google ScholarCross Ref
Martin Karu and Tanel Alumae. 2018. Weakly supervised training of speaker identifcation models. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 24-- 30.Google ScholarCross Ref
Tomi Kinnunen, Md Sahidullah, Hector Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofng attack detection. In Proc. Interspeech 2017. ISCA, Stockholm, 2--6.Google ScholarCross Ref
Yee Wah Lau, M. Wagner, and D. Tran. 2004. Vulnerability of speaker verifcation to voice mimicking. In Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing. IEEE, Hong Kong, 145--148.Google Scholar
Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio replay attack detection with deep learning frameworks. In Proc. Interspeech 2017. ISCA, Stockholm, 82--86.Google ScholarCross Ref
Bo Li, Tara N. Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak, Golan Punduk, Kean Chin, Khe Chai Sim, Ron J. Weiss, Kevin Wilson, Ehsan Variani, Chanwoo Kim, Olivier Siohan, Mitchel Weintraub, Erik McDermott, Rick Rose, and Matt Shannon. 2017. Acoustic modeling for Google Home. In Proc. Interspeech 2017. ISCA, Stockholm, 399--403.Google ScholarCross Ref
Rui Liu, Cory Cornelius, Reza Rawassizadeh, Ronald Peterson, and David Kotz. 2018. Vocal Resonance: Using Internal Body Voice for Wearable Authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (2018), 1--23. Google ScholarDigital Library
Michael McTear, Zoraida Callejas, and David Griol. 2016. The Conversational Interface: Talking to Smart Devices. Springer, Switzerland. 166--167 pages. Google ScholarDigital Library
Shihono Mochizuki, Sayaka Shiota, and Hitoshi Kiya. 2018. Voice liveness detection using phoneme-based pop-noise detector for speaker verifcation. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 233--239.Google ScholarCross Ref
Monisankha Pal, Dipjyoti Paul, and Goutam Saha. 2018. Synthesis speech detection using fundamental frequency variation and spectral features. Computer Speech and Language 48 (2018), 31--50. Google ScholarDigital Library
Saurabh Panjwani and Achintya Prakash. 2014. Crowdsourcing attacks on biometric systems. In Proceedings of the Tenth Symposium On Usable Privacy and Security (SOUPS 2014). USENIX, California, 257--269. Google ScholarDigital Library
Tanvina B. Patel and Hemant A. Patil. 2015. Combining evidences from mel cepstral, cochlear flter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Proc. Interspeech 2015. ISCA, Dresden, 2062--2066.Google Scholar
Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2018. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In 6th International Conference on Learning Representations. ICLR, Vancouver.Google Scholar
Douglas A. Reynolds and Richard C. Rose. 1995. Robust textindependent speaker identifcation using Gaussian mixture speaker models. 1995 IEEE Transactions on Speech and Audio Processing 3 (1995), 72--83.Google Scholar
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarCross Ref
Roberto Togneri and Daniel Pullella. 2011. An overview of speaker identifcation: Accuracy and Robustness Issues. IEEE Circuits and Systems Magazine, 09 June 2011 11 (2011), 23--61.Google Scholar
Francis Tom, Mohit Jain, and Prasenjit Dey. 2018. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Proc. Interspeech 2018. ISCA, Hyderabad, 681--685.Google ScholarCross Ref
Jesus Villalba, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. 2015. Spoofng detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2067--2071.Google ScholarCross Ref
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonhui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurou. 2017. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech 2017. ISCA, Stockholm, 4006--4010.Google ScholarCross Ref
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md. Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: The frst automatic speaker verifcation spoofng and countermeasures challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2037--2041.Google ScholarCross Ref
Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authenticatio. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, Texas, 57--71. Google ScholarDigital Library
Xiaojia Zhao, Yuxuan Wang, and DeLiang Wang. 2014. Robust speaker identifcation in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (2014), 836--845. Google ScholarDigital Library

Index Terms

Voice Presentation Attack Detection through Text-Converted Voice Command Analysis
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Usability in security and privacy
  2. Intrusion/anomaly detection and malware mitigation

Recommendations

A voice command detection system for aerospace applications

Nowadays, according to ever-increasing volumes of audio content, audio processing is a vital need. In the aerospace field, voice commands could be used instead of data commands in order to speed up the command transmission, help crewmembers to complete ...
Read More
On robustness of speech based biometric systems against voice conversion attack

Graphical abstractDisplay Omitted HighlightsEvaluation of robustness of SID and SV systems against VC spoofing attack.The vulnerability in decreasing order of VC techniques is GMM, WFW and WFW-.In SV systems, GMM-SVM is more resilient than GMM-UBM for ...
Read More
A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus

A multilingual synthesizer synthesizes speech, for any given monolingual or mixed-language text, that is intelligible to human listeners. The necessity for such synthesizer arises in a country like India, where multiple languages coexist. For the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
May 2019
9077 pages
ISBN:9781450359702
DOI:10.1145/3290605
General Chairs:
Stephen Brewster
University of Glasgow, Scotland, UK
,
Geraldine Fitzpatrick
TU Wien, Austria
,
Program Chairs:
Anna Cox
University College London, UK
,
Vassilis Kostakos
University of Melbourne, Australia
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attack detection
voice assistant security
voice command analysis
Qualifiers
- research-article
Conference

Acceptance Rates
CHI '19 Paper Acceptance Rate703of2,958submissions,24%Overall Acceptance Rate6,199of26,314submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 721
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Voice Presentation Attack Detection through Text-Converted Voice Command Analysis

CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A voice command detection system for aerospace applications

On robustness of speech based biometric systems against voice conversion attack

A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus