Abstract
Audiovisual speech synchrony detection is an important liveness check for talking face verification systems in order to make sure that the input biometric samples are actually acquired from the same source. In prior work, the used visual speech features have been mainly describing facial appearance or mouth shape in frame-wise manner, thus ignoring the lip motion between consecutive frames. Since also the visual speech dynamics are important, we take the spatiotemporal information into account and propose the use of space-time auto-correlation of gradients (STACOG) for measuring the audiovisual synchrony. For evaluating the effectiveness of the proposed approach, a set of challenging and realistic attack scenarios are designed by augmenting publicly available BANCA and XM2VTS datasets with synthetic replay attacks. Our experimental analysis shows that the STACOG features outperform the state of the art, e.g. discrete cosine transform based features, in measuring the audiovisual synchrony.
Similar content being viewed by others
References
Argones Rúa E, Bredin H, Garca Mateo C, Chollet G, Gonzlez Jimnez D (2009) Audio-visual speech asynchrony detection using co-inertia analysis and coupled hidden markov models. Pattern Anal Applic 12(3):271–284
Bailly-Baillire E, Bengio S, Bimbot F, Hamouz M, Kittler J, Marithoz J, Matas J, Messer K, Popovici V, Pore F, Ruiz B, Thiran JP (2003) The banca database and evaluation protocol. In: Kittler J, Nixon M (eds) Audio- and Video-Based Biometric Person Authentication, Lecture Notes in Computer Science, vol 2688, pp 625–638. Springer, Berlin
Ben-Yacoub S, Abdeljaoued Y, Mayoraz E (1999) Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks 10(5):1065–1074
Bredin H, Chollet G (2008) Making talking-face authentication robust to deliberate imposture. In: International conference on acoustics, speech and signal processing (ICASSP), pp 1693–1696
Chetty G (2009) Biometric liveness detection based on cross modal fusion. In: 12th International conference on information fusion, FUSION ’09, pp 2255–2262
Chetty G (2010) Robust audio visual biometric person authentication with liveness verification. In: Sencar H, Velastin S, Nikolaidis N, Lian S (eds) Intelligent multimedia analysis for security applications, studies in computational intelligence, vol 282, pp 59–78. Springer, Berlin
EL-Sallam AA, Mian AS (2011) Correlation based speech-video synchronization. Pattern Recogn Lett 32(6):780–786
Eveno N, Besacier L (2005) Co-inertia analysis for ”liveness” test in audio-visual biometrics. In: International symposium on image and signal processing and analysis, (ISPA), pp 257–261
Faraj MI, Bigun J (2007) Audio-visual person authentication using lip-motion from orientation maps. Pattern Recogn Lett 28(11):1368–1382
Fauve B, Bredin H, Karam W, Verdet F, Mayoue A, Chollet G, Hennebert J, Lewis R, Mason J, Mokbel C, Petrovska D (2008) Some results from the biosecure talking face evaluation campaign. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2008, pp 4137–4140
Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Karam W, Bredin H, Greige H, Chollet G, Mokbel C (2009) Talking-face identity verification, audiovisual forgery, and robustness issues. EURASIP Journal on Advances in Signal Processing 4
Kobayashi T, Otsu N (2008) Image feature extraction using gradient local auto-correlations. In: Proceedings of the 10th European conference on computer vision: Part I, ECCV ’08, pp 346–358. Springer, Berlin
Kobayashi T, Otsu N (2012) Motion recognition using local auto-correlation of space-time gradients. Pattern Recognit Lett 33(9):1188–1195
Liu Y, Sato Y (2010) Recovery of audio-to-video synchronization through analysis of cross-modality correlation. Pattern Recognit Lett 31(8):696–701
Marcel S, Nixon MS, Li SZ (2014) Handbook of Biometric Anti-Spoofing: Trusted Biometrics Under Spoofing Attacks. Springer
Messer K, Matas J, Kittler J, Jonsson K (1999) Xm2vtsdb: The extended m2vts database. In: 2nd international conference on audio and video-based biometric person authentication, pp 72–77
Rodrigues RN, Ling LL, Govindaraju V (2009) Robustness of multimodal biometric fusion methods against spoof attacks. Journal of Visual Language and Computing 20(3):169–179
Rosipal R, Krmer N (2006) Overview and recent advances in partial least squares. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, Latent Structure and Feature Selection, Lecture Notes in Computer Science, vol 3940, pp 34–51. Springer, Berlin
Slaney M, Covell M (2000) Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Neural information processing systems conference, pp 814–820
Uṙiċȧṙ M, Franc V, Thomas D, Akihiro S, Hlavȧċ V (2015) Real-time multi-view facial landmark detector learned by the structured output svm. In: IEEE international conference on automatic face and gesture recognition conference and workshops. IEEE
Viola P, Jones MJ (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154
Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark localization in the wild. In: IEEE conference on computer vision and pattern recognition, pp 2879–2886
Zhu ZY, He QH, Feng XH, Li YX, Feng Wang Z (2013) Liveness detection using time drift between lip movement and voice. In: International conference on machine learning and cybernetics (ICMLC), vol 02, pp 973–978
Acknowledgments
E. Boutellaa is acknowledging the financial support of the Algerian MESRS and CDTA under the grant number 060/PNE/ENS/FINLANDE/2014-2015. The support of the Academy of Finland and Infotech Oulu Doctoral Program is also acknowledged.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Boutellaa, E., Boulkenafet, Z., Komulainen, J. et al. Audiovisual synchrony assessment for replay attack detection in talking face biometrics. Multimed Tools Appl 75, 5329–5343 (2016). https://doi.org/10.1007/s11042-015-2848-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2848-2