ABSTRACT
Voice assistants are quickly being upgraded to support advanced, security-critical commands such as unlocking devices, checking emails, and making payments. In this paper, we explore the feasibility of using users' text-converted voice command utterances as classification features to help identify users' genuine commands, and detect suspicious commands. To maintain high detection accuracy, our approach starts with a globally trained attack detection model (immediately available for new users), and gradually switches to a user-specific model tailored to the utterance patterns of a target user. To evaluate accuracy, we used a real-world voice assistant dataset consisting of about 34.6 million voice commands collected from 2.6 million users. Our evaluation results show that this approach is capable of achieving about 3.4% equal error rate (EER), detecting 95.7% of attacks when an optimal threshold value is used. As for those who frequently use security-critical (attack-like) commands, we still achieve EER below 5%.
Supplemental Material
- Muzhir Shaban Al-Ani, Thabit Sultan Mohammed, and Karim M. Aljebory. 2007. Speaker identifcation: A hybrid approach using neural networks and wavelet transform. Journal of Computer Science 3 (2007), 304--309.Google ScholarCross Ref
- Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. 2017. Deep Voice 2: Multi-speaker neural text-to-speech. In Advances in Neural Information Processing Systems 30. NIPS, California. Google ScholarDigital Library
- Bulent Ayhan and Chiman Kwan. 2018. Robust speaker identifcation algorithms and results in noisy environments. In International Symposium on Neural Networks 2018. ISNN, Minsk, 443--450.Google Scholar
- Logan Blue, Luis Vargas, and Patrick Traynor. 2018. Hello, Is It Me You're Looking For?: Diferentiating Between Human and Electronic Speakers for Voice Interface Security. In Proceedings of the 11th ACM Conference on Security Privacy in Wireless and Mobile Networks. ACM, Stockholm, 123--133. Google ScholarDigital Library
- Jean-Francois Bonastre, Driss Matrouf, and Corinne Fredouille. 2007. Artifcial impostor voice transformation efects on false acceptance rates. In Proc. Interspeech 2007. ISCA, Antwerp, 2053--2056.Google ScholarCross Ref
- Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In 25th USENIX Security Symposium. USENIX, Texas. Google ScholarDigital Library
- Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on Speech-to-Text. In 1st Deep Learning and Security Workshop. IEEE, California.Google ScholarCross Ref
- Si Chen, Kui Ren, Sixu Piao, Cong Wang, Qian Wang, Jian Weng, Lu Su, and Aziz Mohaisen. 2017. You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems. IEEE, Georgia, 183-195.Google ScholarCross Ref
- Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarCross Ref
- Charles Elkan. 2001. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artifcial Intelligence (IJCAI'01). IJCAI, Washington, 973--978. Google ScholarDigital Library
- Serife Kucur Ergunay, Elie Khoury, Alexandros Lazaridis, and Sebastien Marcel. 2015. On the vulnerability of speaker verifcation to realistic voice spoofng. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, Virginia, 1--8.Google ScholarCross Ref
- Adrienne Porter Felt, Serge Egelman, and David Wagner. 2012. I'Ve Got 99 Problems, but Vibration Ain'T One: A Survey of Smartphone Users' Concerns. In Proceedings of the Second ACM Workshop on Security and Privacy in Smartphones and Mobile Devices (SPSM '12). ACM, New York, NY, USA, 33--44. Google ScholarDigital Library
- Huan Feng, Kassem Fawaz, and Kang G. Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. ACM, Utah, 343--355. Google ScholarDigital Library
- Kuruvachan K. George, C. Santhosh Kumar, Ashish Panda, K. I. Ramachandran, K. Arun Das, and S. Veni. 2015. Minimizing the false alarm probability of speaker verifcation systems for mimicked speech. In 2015 Intl. Conference on Computing and Network Communications. IEEE, Trivandrum, 703--709.Google Scholar
- David Money Harris and Sarah L. Harris. 2013. Digital Design and Computer Architecture (2nd Ed). Morgan Kaufmann, San Francisco. 129--131 pages.Google Scholar
- Qin Jin, Arthur R. Toth, Alan W. Black, and Tanja Schultz. 2008. Is voice transformation a threat to speaker identifcation?. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Nevada, 4845--4848.Google ScholarCross Ref
- Martin Karu and Tanel Alumae. 2018. Weakly supervised training of speaker identifcation models. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 24-- 30.Google ScholarCross Ref
- Tomi Kinnunen, Md Sahidullah, Hector Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofng attack detection. In Proc. Interspeech 2017. ISCA, Stockholm, 2--6.Google ScholarCross Ref
- Yee Wah Lau, M. Wagner, and D. Tran. 2004. Vulnerability of speaker verifcation to voice mimicking. In Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing. IEEE, Hong Kong, 145--148.Google Scholar
- Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio replay attack detection with deep learning frameworks. In Proc. Interspeech 2017. ISCA, Stockholm, 82--86.Google ScholarCross Ref
- Bo Li, Tara N. Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak, Golan Punduk, Kean Chin, Khe Chai Sim, Ron J. Weiss, Kevin Wilson, Ehsan Variani, Chanwoo Kim, Olivier Siohan, Mitchel Weintraub, Erik McDermott, Rick Rose, and Matt Shannon. 2017. Acoustic modeling for Google Home. In Proc. Interspeech 2017. ISCA, Stockholm, 399--403.Google ScholarCross Ref
- Rui Liu, Cory Cornelius, Reza Rawassizadeh, Ronald Peterson, and David Kotz. 2018. Vocal Resonance: Using Internal Body Voice for Wearable Authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2 (2018), 1--23. Google ScholarDigital Library
- Michael McTear, Zoraida Callejas, and David Griol. 2016. The Conversational Interface: Talking to Smart Devices. Springer, Switzerland. 166--167 pages. Google ScholarDigital Library
- Shihono Mochizuki, Sayaka Shiota, and Hitoshi Kiya. 2018. Voice liveness detection using phoneme-based pop-noise detector for speaker verifcation. In Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA, Les Sables d'Olonne, 233--239.Google ScholarCross Ref
- Monisankha Pal, Dipjyoti Paul, and Goutam Saha. 2018. Synthesis speech detection using fundamental frequency variation and spectral features. Computer Speech and Language 48 (2018), 31--50. Google ScholarDigital Library
- Saurabh Panjwani and Achintya Prakash. 2014. Crowdsourcing attacks on biometric systems. In Proceedings of the Tenth Symposium On Usable Privacy and Security (SOUPS 2014). USENIX, California, 257--269. Google ScholarDigital Library
- Tanvina B. Patel and Hemant A. Patil. 2015. Combining evidences from mel cepstral, cochlear flter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Proc. Interspeech 2015. ISCA, Dresden, 2062--2066.Google Scholar
- Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2018. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In 6th International Conference on Learning Representations. ICLR, Vancouver.Google Scholar
- Douglas A. Reynolds and Richard C. Rose. 1995. Robust textindependent speaker identifcation using Gaussian mixture speaker models. 1995 IEEE Transactions on Speech and Audio Processing 3 (1995), 72--83.Google Scholar
- Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Alberta.Google ScholarCross Ref
- Roberto Togneri and Daniel Pullella. 2011. An overview of speaker identifcation: Accuracy and Robustness Issues. IEEE Circuits and Systems Magazine, 09 June 2011 11 (2011), 23--61.Google Scholar
- Francis Tom, Mohit Jain, and Prasenjit Dey. 2018. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Proc. Interspeech 2018. ISCA, Hyderabad, 681--685.Google ScholarCross Ref
- Jesus Villalba, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. 2015. Spoofng detection with DNN and one-class SVM for the ASVspoof 2015 challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2067--2071.Google ScholarCross Ref
- Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonhui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurou. 2017. Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeech 2017. ISCA, Stockholm, 4006--4010.Google ScholarCross Ref
- Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md. Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: The frst automatic speaker verifcation spoofng and countermeasures challenge. In Proc. Interspeech 2015. ISCA, Dresden, 2037--2041.Google ScholarCross Ref
- Linghan Zhang, Sheng Tan, and Jie Yang. 2017. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authenticatio. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, Texas, 57--71. Google ScholarDigital Library
- Xiaojia Zhao, Yuxuan Wang, and DeLiang Wang. 2014. Robust speaker identifcation in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (2014), 836--845. Google ScholarDigital Library
Index Terms
- Voice Presentation Attack Detection through Text-Converted Voice Command Analysis
Recommendations
A voice command detection system for aerospace applications
Nowadays, according to ever-increasing volumes of audio content, audio processing is a vital need. In the aerospace field, voice commands could be used instead of data commands in order to speed up the command transmission, help crewmembers to complete ...
On robustness of speech based biometric systems against voice conversion attack
Graphical abstractDisplay Omitted HighlightsEvaluation of robustness of SID and SV systems against VC spoofing attack.The vulnerability in decreasing order of VC techniques is GMM, WFW and WFW-.In SV systems, GMM-SVM is more resilient than GMM-UBM for ...
A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
A multilingual synthesizer synthesizes speech, for any given monolingual or mixed-language text, that is intelligible to human listeners. The necessity for such synthesizer arises in a country like India, where multiple languages coexist. For the ...
Comments