Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

Authors:
Ayush Jaiswal

USC Information Sciences Institute, Marina del Rey, CA, USA

USC Information Sciences Institute, Marina del Rey, CA, USA
View Profile

,
Ekraam Sabir

USC Information Sciences Institute, Marina del Rey, CA, USA

USC Information Sciences Institute, Marina del Rey, CA, USA
View Profile

,
Wael AbdAlmageed

USC Information Sciences Institute, Marina del Rey, CA, USA

USC Information Sciences Institute, Marina del Rey, CA, USA
View Profile

,
Premkumar Natarajan

USC Information Sciences Institute, Marina del Rey, CA, USA

USC Information Sciences Institute, Marina del Rey, CA, USA
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1465–1471https://doi.org/10.1145/3123266.3123385

Published:19 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1465–1471

ABSTRACT

Real-world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g., captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is therefore important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper we present a novel deep-learning-based approach that uses a reference set of multimedia packages to assess the semantic integrity of multimedia packages containing images and captions. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we are making available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F-1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.

References

Sufyan Ababneh, Rashid Ansari, and Ashfaq Khokhar. 2008. Scalable multimedia-content integrity verification with robust hashing Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on. IEEE, 263--266. http://ieeexplore.ieee.org/abstract/document/4554310/Google Scholar
Francis R. Bach and Michael I. Jordan. 2002. Kernel independent component analysis. Journal of machine learning research Vol. 3, Jul (2002), 1--48. http://www.jmlr.org/papers/v3/bach02a Google ScholarDigital Library
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). https://arxiv.org/abs/1504.00325Google Scholar
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems. 2121--2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model Google ScholarDigital Library
Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science, Vol. 313, 5786 (2006), 504--507. http://science.sciencemag.org/content/313/5786/504.shortGoogle Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47 (2013), 853--899. http://www.jair.org/papers/paper3994.html Google ScholarDigital Library
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014). https://arxiv.org/abs/1411.2539Google Scholar
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755. http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48Google Scholar
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking European Conference on Computer Vision. Springer, 261--277. http://link.springer.com/chapter/10.1007/978-3-319-46475-6_17Google Scholar
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 413--422. http://ieeexplore.ieee.org/abstract/document/4781136/Google Scholar
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014). https://arxiv.org/abs/1410.1090Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119. Google ScholarDigital Library
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689--696. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_399.pdf Google ScholarDigital Library
Bernhard Schölkopf, Robert C. Williamson, Alexander J. Smola, John Shawe-Taylor, John C. Platt, and others. 1999. Support vector method for novelty detection.. In NIPS, Vol. Vol. 12. 582--588. https://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection.pdf Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556Google Scholar
Rui Sun and Wenjun Zeng. 2014. Secure and robust image hashing via compressive sensing. Multimedia tools and applications Vol. 70, 3 (2014), 1651--1665. http://link.springer.com/article/10.1007/s11042-012--1188--8 Google ScholarDigital Library
Vedran Vukotić, Christian Raymond, and Guillaume Gravier. 2016. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 343--346. http://dl.acm.org/citation.cfm?id=2912064 Google ScholarDigital Library
Xiaofeng Wang, Kemu Pang, Xiaorui Zhou, Yang Zhou, Lu Li, and Jianru Xue. 2015. A visual model-based perceptual image hash for content authentication. IEEE Transactions on Information Forensics and Security, Vol. 10, 7 (2015), 1336--1349. http://ieeexplore.ieee.org/abstract/document/7050251/Google ScholarDigital Library
Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, Vol. 81, 1 (2010), 21--35. http://www.springerlink.com/index/Y277128518468756.pdf Google ScholarDigital Library
Cai-Ping Yan, Chi-Man Pun, and Xiao-Chen Yuan. 2016. Multi-scale image hashing using adaptive local feature extraction for robust tampering detection. Signal Processing Vol. 121 (2016), 1--16. http://www.sciencedirect.com/science/article/pii/S0165168415003709 Google ScholarDigital Library
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 67--78. https://www.transacl.org/ojs/index.php/tacl/article/view/229Google ScholarCross Ref

Index Terms

Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
    2. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
    2. Machine learning approaches
      1. Neural networks

Recommendations

A Text-Book of Practical Therapeutics: With Especial Reference to the Application of Remedial Measures to Disease and Their Employment Upon a Rational Basis
Read More
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Read More
A Text-Book of Practical Therapeutics: With Especial Reference to the Application of Remedial Measures to Disease and Their Employment Upon a Rational Basis
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimedia data
multimodal semantic integrity
semantic integrity assemessment
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 378
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.