skip to main content
10.1145/3394171.3413700acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Skip Supplemental Material Section

Supplemental Material

3394171.3413700.mp4

mp4

94.3 MB

References

  1. Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  2. Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google ScholarGoogle Scholar
  3. Maneesh Bilalpur, Mohan Kankanhalli, Stefan Winkler, and Ramanathan Subramanian. 2018. EEG-Based Evaluation of Cognitive Workload Induced by Acoustic Parameters for Data Sonification. In Int'l Conference on Multimodal Interaction. 315--323.Google ScholarGoogle Scholar
  4. Joon Son Chung and Andrew Zisserman. 2017. Out of Time: Automated Lip Sync in the Wild. 251--263. https://doi.org/10.1007/978-3-319-54427-4_19Google ScholarGoogle Scholar
  5. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  6. Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole Baram, and Cristian Canton Ferrer. 2019. The Deepfake Detection Challenge (DFDC) Preview Dataset. arxiv: cs.CV/1910.08854Google ScholarGoogle Scholar
  7. Luciano Floridi. 2018. Artificial Intelligence, Deepfakes and a Future of Ectypes. Philosophy & Technology, Vol. 31, 3 (2018), 317--321.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Güera and E. J. Delp. 2018. Deepfake Video Detection Using Recurrent Neural Networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.Google ScholarGoogle Scholar
  9. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2672--2680.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tom Grimes. 1991. Mild Auditory-Visual Dissonance in Television News May Exceed Viewer Attentional Capacity.Google ScholarGoogle Scholar
  11. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? CoRR, Vol. abs/1711.09577 (2017). arxiv: 1711.09577 http://arxiv.org/abs/1711.09577Google ScholarGoogle Scholar
  12. Pavel Korshunov and Sé bastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. CoRR, Vol. abs/1812.08685 (2018). arxiv: 1812.08685 http://arxiv.org/abs/1812.08685Google ScholarGoogle Scholar
  13. Y. Li, M. Chang, and S. Lyu. 2018. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). 1--7.Google ScholarGoogle Scholar
  14. Yuezun Li and Siwei Lyu. 2018. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In CVPR Workshops.Google ScholarGoogle Scholar
  15. Jorge Martinez, Hector Perez, Enrique Escamilla, and Masahisa Mabo Suzuki. 2012. Speaker recognition using Mel frequency Cepstral Coefficients (MFCC) and Vector quantization (VQ) techniques. In Proceedings of the International Conference on Electrical Communications and Computers. IEEE, 248--251.Google ScholarGoogle ScholarCross RefCross Ref
  16. F. Matern, C. Riess, and M. Stamminger. 2019. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). 83--92.Google ScholarGoogle Scholar
  17. Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues. arxiv: cs.CV/2003.06711Google ScholarGoogle Scholar
  18. Nelson Mogran, Hervé Bourlard, and Hynek Hermansky. 2004. Automatic Speech Recognition: An Auditory Perspective. Springer New York, New York, NY, 309--338. https://doi.org/10.1007/0--387--21575--1_6Google ScholarGoogle Scholar
  19. K. L. Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, and Vineet Gandhi. 2020. GAZED-- Gaze-Guided Cinematic Editing of Wide-Angle Monocular Video Recordings. In Human Factors in Computing Systems. 1--11.Google ScholarGoogle Scholar
  20. Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. CoRR, Vol. abs/1906.06876 (2019). arxiv: 1906.06876 http://arxiv.org/abs/1906.06876Google ScholarGoogle Scholar
  21. H. H. Nguyen, J. Yamagishi, and I. Echizen. 2019. Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2307--2311.Google ScholarGoogle Scholar
  22. Subramanian Ramanathan, Harish Katti, Raymond Huang, Tat-Seng Chua, and Mohan Kankanhalli. 2009. Automated Localization of Affective Objects and Actions in Images via Caption Text-Cum-Eye Gaze Analysis. In ACM International Conference on Multimedia. 729--732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics: Learning to detect manipulated facial images. In Proceedings of the IEEE International Conference on Computer Vision. 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  24. Conrad Sanderson and Brian C. Lovell. 2009. Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference. In Advances in Biometrics, Massimo Tistarelli and Mark S. Nixon (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 199--208.Google ScholarGoogle Scholar
  25. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618--626.Google ScholarGoogle ScholarCross RefCross Ref
  26. Karan Sikka, Abhinav Dhall, and Marian Stewart Bartlett. 2014. Classification and weakly supervised pain localization using multiple segment representation. Image and vision computing, Vol. 32, 10 (2014), 659--670.Google ScholarGoogle Scholar
  27. J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE.Google ScholarGoogle Scholar
  28. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and Composing Robust Features with Denoising Autoencoders. In International Conference on Machine Learning (ICML '08). 1096--1103.Google ScholarGoogle Scholar
  29. Yue Wu, Wael Abd-Almageed, and Prem Natarajan. 2018. Busternet: Detecting copy-move image forgery with source/target localization. In Proceedings of the European Conference on Computer Vision (ECCV). 168--184.Google ScholarGoogle ScholarCross RefCross Ref
  30. Aya Yadlin-Segal and Yael Oppenheim. 0. Whose dystopia is it anyway? Deepfakes and social media regulation. Convergence, Vol. 0, 0 ( 0), 1354856520923963. https://doi.org/10.1177/1354856520923963Google ScholarGoogle Scholar
  31. X. Yang, Y. Li, and S. Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265.Google ScholarGoogle Scholar
  32. Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision. 192--201.Google ScholarGoogle Scholar
  33. Peng Zhou, Xintong Han, Vlad I. Morariu, and Larry S. Davis. 2018. Two-Stream Neural Networks for Tampered Face Detection. CoRR, Vol. abs/1803.11276 (2018). arxiv: 1803.11276 http://arxiv.org/abs/1803.11276Google ScholarGoogle Scholar

Index Terms

  1. Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '20: Proceedings of the 28th ACM International Conference on Multimedia
        October 2020
        4889 pages
        ISBN:9781450379885
        DOI:10.1145/3394171

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 October 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader