ABSTRACT
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
Supplemental Material
- Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1--7.Google ScholarCross Ref
- Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google Scholar
- Maneesh Bilalpur, Mohan Kankanhalli, Stefan Winkler, and Ramanathan Subramanian. 2018. EEG-Based Evaluation of Cognitive Workload Induced by Acoustic Parameters for Data Sonification. In Int'l Conference on Multimodal Interaction. 315--323.Google Scholar
- Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. 251--263. https://doi.org/10.1007/978-3-319-54427-4_19Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248--255.Google ScholarCross Ref
- Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. 2019. The Deepfake Detection Challenge (DFDC) Preview Dataset. arxiv: cs.CV/1910.08854Google Scholar
- Luciano Floridi. 2018. Artificial Intelligence, Deepfakes and a Future of Ectypes. Philosophy & Technology, Vol. 31, 3 (2018), 317--321.Google ScholarCross Ref
- D. Güera and E. J. Delp. 2018. Deepfake Video Detection Using Recurrent Neural Networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680.Google ScholarDigital Library
- Tom Grimes. 1991. Mild Auditory-Visual Dissonance in Television News May Exceed Viewer Attentional Capacity.Google Scholar
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CoRR, Vol. abs/1711.09577 (2017). arxiv: 1711.09577 http://arxiv.org/abs/1711.09577Google Scholar
- Pavel Korshunov and Sé bastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. CoRR, Vol. abs/1812.08685 (2018). arxiv: 1812.08685 http://arxiv.org/abs/1812.08685Google Scholar
- Y. Li, M. Chang, and S. Lyu. 2018. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 1--7.Google Scholar
- Yuezun Li and Siwei Lyu. 2018. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In CVPR Workshops.Google Scholar
- Jorge Martinez, Hector Perez, Enrique Escamilla, and Masahisa Mabo Suzuki. 2012. Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques. In Proceedings of the International Conference on Electrical Communications and Computers. IEEE, 248--251.Google ScholarCross Ref
- F. Matern, C. Riess, and M. Stamminger. 2019. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). 83--92.Google Scholar
- Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues. arxiv: cs.CV/2003.06711Google Scholar
- Nelson Mogran, Hervé Bourlard, and Hynek Hermansky. 2004. Automatic Speech Recognition: An Auditory Perspective. Springer New York, New York, NY, 309--338. https://doi.org/10.1007/0--387--21575--1_6Google Scholar
- K. L. Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, and Vineet Gandhi. 2020. GAZED-- Gaze-Guided Cinematic Editing of Wide-Angle Monocular Video Recordings. In Human Factors in Computing Systems. 1--11.Google Scholar
- Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. CoRR, Vol. abs/1906.06876 (2019). arxiv: 1906.06876 http://arxiv.org/abs/1906.06876Google Scholar
- H. H. Nguyen, J. Yamagishi, and I. Echizen. 2019. Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2307--2311.Google Scholar
- Subramanian Ramanathan, Harish Katti, Raymond Huang, Tat-Seng Chua, and Mohan Kankanhalli. 2009. Automated Localization of Affective Objects and Actions in Images via Caption Text-Cum-Eye Gaze Analysis. In ACM International Conference on Multimedia. 729--732.Google ScholarDigital Library
- Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision. 1--11.Google ScholarCross Ref
- Conrad Sanderson and Brian C. Lovell. 2009. Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference. In Advances in Biometrics, Massimo Tistarelli and Mark S. Nixon (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 199--208.Google Scholar
- Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarCross Ref
- Karan Sikka, Abhinav Dhall, and Marian Stewart Bartlett. 2014. Classification and weakly supervised pain localization using multiple segment representation. Image and vision computing, Vol. 32, 10 (2014), 659--670.Google Scholar
- J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE.Google Scholar
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust Features with Denoising Autoencoders. In International Conference on Machine Learning (ICML '08). 1096--1103.Google Scholar
- Yue Wu, Wael Abd-Almageed, and Prem Natarajan. 2018. Busternet: Detecting copy-move image forgery with source/target localization. In Proceedings of the European Conference on Computer Vision (ECCV). 168--184.Google ScholarCross Ref
- Aya Yadlin-Segal and Yael Oppenheim. 0. Whose dystopia is it anyway? Deepfakes and social media regulation. Convergence, Vol. 0, 0 ( 0), 1354856520923963. https://doi.org/10.1177/1354856520923963Google Scholar
- X. Yang, Y. Li, and S. Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265.Google Scholar
- Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision. 192--201.Google Scholar
- Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. Davis. 2018. Two-Stream Neural Networks for Tampered Face Detection. CoRR, Vol. abs/1803.11276 (2018). arxiv: 1803.11276 http://arxiv.org/abs/1803.11276Google Scholar
Index Terms
- Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization
Recommendations
Audio-Visual Event Localization in Unconstrained Videos
Computer Vision – ECCV 2018AbstractIn this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset ...
Audio-Visual Salient Object Detection
Intelligent Computing Theories and ApplicationAbstractThis paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images ...
Decision-Level Fusion for Audio-Visual Laughter Detection
MLMI '08: Proceedings of the 5th international workshop on Machine Learning for Multimodal InteractionLaughter is a highly variable signal, which can be caused by a spectrum of emotions. This makes the automatic detection of laughter a challenging, but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting ...
Comments