research-article

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Authors:
Komal Chugh

Indian Institute of Technology Ropar, Ropar, India

Indian Institute of Technology Ropar, Ropar, India
View Profile

,
Parul Gupta

Indian Institute of Technology Ropar, Ropar, India

Indian Institute of Technology Ropar, Ropar, India
View Profile

,
Abhinav Dhall

Monash University & Indian Institute of Technology Ropar, City of Monash, Australia

Monash University & Indian Institute of Technology Ropar, City of Monash, Australia
View Profile

,
Ramanathan Subramanian

Indian Institute of Technology Ropar, Ropar, India

Indian Institute of Technology Ropar, Ropar, India
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 439–447https://doi.org/10.1145/3394171.3413700

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 439–447

ABSTRACT

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Supplemental Material

3394171.3413700.mp4

mp4

94.3 MB

Download

References

Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1--7.Google ScholarCross Ref
Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google Scholar
Maneesh Bilalpur, Mohan Kankanhalli, Stefan Winkler, and Ramanathan Subramanian. 2018. EEG-Based Evaluation of Cognitive Workload Induced by Acoustic Parameters for Data Sonification. In Int'l Conference on Multimodal Interaction. 315--323.Google Scholar
Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. 251--263. https://doi.org/10.1007/978-3-319-54427-4_19Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248--255.Google ScholarCross Ref
Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. 2019. The Deepfake Detection Challenge (DFDC) Preview Dataset. arxiv: cs.CV/1910.08854Google Scholar
Luciano Floridi. 2018. Artificial Intelligence, Deepfakes and a Future of Ectypes. Philosophy & Technology, Vol. 31, 3 (2018), 317--321.Google ScholarCross Ref
D. Güera and E. J. Delp. 2018. Deepfake Video Detection Using Recurrent Neural Networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680.Google ScholarDigital Library
Tom Grimes. 1991. Mild Auditory-Visual Dissonance in Television News May Exceed Viewer Attentional Capacity.Google Scholar
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CoRR, Vol. abs/1711.09577 (2017). arxiv: 1711.09577 http://arxiv.org/abs/1711.09577Google Scholar
Pavel Korshunov and Sé bastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. CoRR, Vol. abs/1812.08685 (2018). arxiv: 1812.08685 http://arxiv.org/abs/1812.08685Google Scholar
Y. Li, M. Chang, and S. Lyu. 2018. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 1--7.Google Scholar
Yuezun Li and Siwei Lyu. 2018. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In CVPR Workshops.Google Scholar
Jorge Martinez, Hector Perez, Enrique Escamilla, and Masahisa Mabo Suzuki. 2012. Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques. In Proceedings of the International Conference on Electrical Communications and Computers. IEEE, 248--251.Google ScholarCross Ref
F. Matern, C. Riess, and M. Stamminger. 2019. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). 83--92.Google Scholar
Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues. arxiv: cs.CV/2003.06711Google Scholar
Nelson Mogran, Hervé Bourlard, and Hynek Hermansky. 2004. Automatic Speech Recognition: An Auditory Perspective. Springer New York, New York, NY, 309--338. https://doi.org/10.1007/0--387--21575--1_6Google Scholar
K. L. Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, and Vineet Gandhi. 2020. GAZED-- Gaze-Guided Cinematic Editing of Wide-Angle Monocular Video Recordings. In Human Factors in Computing Systems. 1--11.Google Scholar
Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. CoRR, Vol. abs/1906.06876 (2019). arxiv: 1906.06876 http://arxiv.org/abs/1906.06876Google Scholar
H. H. Nguyen, J. Yamagishi, and I. Echizen. 2019. Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2307--2311.Google Scholar
Subramanian Ramanathan, Harish Katti, Raymond Huang, Tat-Seng Chua, and Mohan Kankanhalli. 2009. Automated Localization of Affective Objects and Actions in Images via Caption Text-Cum-Eye Gaze Analysis. In ACM International Conference on Multimedia. 729--732.Google ScholarDigital Library
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision. 1--11.Google ScholarCross Ref
Conrad Sanderson and Brian C. Lovell. 2009. Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference. In Advances in Biometrics, Massimo Tistarelli and Mark S. Nixon (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 199--208.Google Scholar
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarCross Ref
Karan Sikka, Abhinav Dhall, and Marian Stewart Bartlett. 2014. Classification and weakly supervised pain localization using multiple segment representation. Image and vision computing, Vol. 32, 10 (2014), 659--670.Google Scholar
J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE.Google Scholar
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust Features with Denoising Autoencoders. In International Conference on Machine Learning (ICML '08). 1096--1103.Google Scholar
Yue Wu, Wael Abd-Almageed, and Prem Natarajan. 2018. Busternet: Detecting copy-move image forgery with source/target localization. In Proceedings of the European Conference on Computer Vision (ECCV). 168--184.Google ScholarCross Ref
Aya Yadlin-Segal and Yael Oppenheim. 0. Whose dystopia is it anyway? Deepfakes and social media regulation. Convergence, Vol. 0, 0 ( 0), 1354856520923963. https://doi.org/10.1177/1354856520923963Google Scholar
X. Yang, Y. Li, and S. Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265.Google Scholar
Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision. 192--201.Google Scholar
Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. Davis. 2018. Two-Stream Neural Networks for Tampered Face Detection. CoRR, Vol. abs/1803.11276 (2018). arxiv: 1803.11276 http://arxiv.org/abs/1803.11276Google Scholar

Index Terms

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization
1. Applied computing
  1. Computer forensics
    1. Investigation techniques
    2. System forensics

Recommendations

Audio-Visual Event Localization in Unconstrained Videos
Computer Vision – ECCV 2018
Abstract
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset ...
Read More
Audio-Visual Salient Object Detection
Intelligent Computing Theories and Application
Abstract
This paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images ...
Read More
Decision-Level Fusion for Audio-Visual Laughter Detection
MLMI '08: Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction

Laughter is a highly variable signal, which can be caused by a spectrum of emotions. This makes the automatic detection of laughter a challenging, but interesting task. We perform automatic laughter detection using audio-visual data from the AMI Meeting ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
contrastive loss
deepfake detection and localization
modality dissonance
neural networks
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 70
  Total Citations
  View Citations
- 1,014
  Total Downloads
- Downloads (Last 12 months)262
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Audio-Visual Event Localization in Unconstrained Videos

Audio-Visual Salient Object Detection

Decision-Level Fusion for Audio-Visual Laughter Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media