ABSTRACT
Real-world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g., captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is therefore important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper we present a novel deep-learning-based approach that uses a reference set of multimedia packages to assess the semantic integrity of multimedia packages containing images and captions. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we are making available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F-1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.
- Sufyan Ababneh, Rashid Ansari, and Ashfaq Khokhar. 2008. Scalable multimedia-content integrity verification with robust hashing Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on. IEEE, 263--266. http://ieeexplore.ieee.org/abstract/document/4554310/Google Scholar
- Francis R. Bach and Michael I. Jordan. 2002. Kernel independent component analysis. Journal of machine learning research Vol. 3, Jul (2002), 1--48. http://www.jmlr.org/papers/v3/bach02a Google ScholarDigital Library
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). https://arxiv.org/abs/1504.00325Google Scholar
- Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems. 2121--2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model Google ScholarDigital Library
- Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science, Vol. 313, 5786 (2006), 504--507. http://science.sciencemag.org/content/313/5786/504.shortGoogle Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47 (2013), 853--899. http://www.jair.org/papers/paper3994.html Google ScholarDigital Library
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014). https://arxiv.org/abs/1411.2539Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755. http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48Google Scholar
- Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking European Conference on Computer Vision. Springer, 261--277. http://link.springer.com/chapter/10.1007/978-3-319-46475-6_17Google Scholar
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 413--422. http://ieeexplore.ieee.org/abstract/document/4781136/Google Scholar
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014). https://arxiv.org/abs/1410.1090Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119. Google ScholarDigital Library
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689--696. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_399.pdf Google ScholarDigital Library
- Bernhard Schölkopf, Robert C. Williamson, Alexander J. Smola, John Shawe-Taylor, John C. Platt, and others. 1999. Support vector method for novelty detection.. In NIPS, Vol. Vol. 12. 582--588. https://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection.pdf Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556Google Scholar
- Rui Sun and Wenjun Zeng. 2014. Secure and robust image hashing via compressive sensing. Multimedia tools and applications Vol. 70, 3 (2014), 1651--1665. http://link.springer.com/article/10.1007/s11042-012--1188--8 Google ScholarDigital Library
- Vedran Vukotić, Christian Raymond, and Guillaume Gravier. 2016. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 343--346. http://dl.acm.org/citation.cfm?id=2912064 Google ScholarDigital Library
- Xiaofeng Wang, Kemu Pang, Xiaorui Zhou, Yang Zhou, Lu Li, and Jianru Xue. 2015. A visual model-based perceptual image hash for content authentication. IEEE Transactions on Information Forensics and Security, Vol. 10, 7 (2015), 1336--1349. http://ieeexplore.ieee.org/abstract/document/7050251/Google ScholarDigital Library
- Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, Vol. 81, 1 (2010), 21--35. http://www.springerlink.com/index/Y277128518468756.pdf Google ScholarDigital Library
- Cai-Ping Yan, Chi-Man Pun, and Xiao-Chen Yuan. 2016. Multi-scale image hashing using adaptive local feature extraction for robust tampering detection. Signal Processing Vol. 121 (2016), 1--16. http://www.sciencedirect.com/science/article/pii/S0165168415003709 Google ScholarDigital Library
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 67--78. https://www.transacl.org/ojs/index.php/tacl/article/view/229Google ScholarCross Ref
Index Terms
- Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
Recommendations
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia RetrievalConstructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Comments