skip to main content
10.1145/3123266.3123385acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Public Access

Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

Authors Info & Claims
Published:19 October 2017Publication History

ABSTRACT

Real-world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g., captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is therefore important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper we present a novel deep-learning-based approach that uses a reference set of multimedia packages to assess the semantic integrity of multimedia packages containing images and captions. We construct a joint embedding of images and captions with deep multimodal representation learning on the reference dataset in a framework that also provides image-caption consistency scores (ICCSs). The integrity of query media packages is assessed as the inlierness of the query ICCSs with respect to the reference dataset. We present the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media packages from Flickr, which we are making available to the research community. We use both the newly created dataset as well as Flickr30K and MS COCO datasets to quantitatively evaluate our proposed approach. The reference dataset does not contain unmanipulated versions of tampered query packages. Our method is able to achieve F-1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO, respectively, for detecting semantically incoherent media packages.

References

  1. Sufyan Ababneh, Rashid Ansari, and Ashfaq Khokhar. 2008. Scalable multimedia-content integrity verification with robust hashing Electro/Information Technology, 2008. EIT 2008. IEEE International Conference on. IEEE, 263--266. http://ieeexplore.ieee.org/abstract/document/4554310/Google ScholarGoogle Scholar
  2. Francis R. Bach and Michael I. Jordan. 2002. Kernel independent component analysis. Journal of machine learning research Vol. 3, Jul (2002), 1--48. http://www.jmlr.org/papers/v3/bach02a Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015). https://arxiv.org/abs/1504.00325Google ScholarGoogle Scholar
  4. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems. 2121--2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Geoffrey E. Hinton and Ruslan R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science, Vol. 313, 5786 (2006), 504--507. http://science.sciencemag.org/content/313/5786/504.shortGoogle ScholarGoogle Scholar
  6. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47 (2013), 853--899. http://www.jair.org/papers/paper3994.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014). https://arxiv.org/abs/1411.2539Google ScholarGoogle Scholar
  9. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755. http://link.springer.com/chapter/10.1007/978-3-319-10602-1_48Google ScholarGoogle Scholar
  10. Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking European Conference on Computer Vision. Springer, 261--277. http://link.springer.com/chapter/10.1007/978-3-319-46475-6_17Google ScholarGoogle Scholar
  11. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 413--422. http://ieeexplore.ieee.org/abstract/document/4781136/Google ScholarGoogle Scholar
  12. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014). https://arxiv.org/abs/1410.1090Google ScholarGoogle Scholar
  13. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689--696. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_399.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bernhard Schölkopf, Robert C. Williamson, Alexander J. Smola, John Shawe-Taylor, John C. Platt, and others. 1999. Support vector method for novelty detection.. In NIPS, Vol. Vol. 12. 582--588. https://papers.nips.cc/paper/1723-support-vector-method-for-novelty-detection.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014). https://arxiv.org/abs/1409.1556Google ScholarGoogle Scholar
  17. Rui Sun and Wenjun Zeng. 2014. Secure and robust image hashing via compressive sensing. Multimedia tools and applications Vol. 70, 3 (2014), 1651--1665. http://link.springer.com/article/10.1007/s11042-012--1188--8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Vedran Vukotić, Christian Raymond, and Guillaume Gravier. 2016. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 343--346. http://dl.acm.org/citation.cfm?id=2912064 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xiaofeng Wang, Kemu Pang, Xiaorui Zhou, Yang Zhou, Lu Li, and Jianru Xue. 2015. A visual model-based perceptual image hash for content authentication. IEEE Transactions on Information Forensics and Security, Vol. 10, 7 (2015), 1336--1349. http://ieeexplore.ieee.org/abstract/document/7050251/Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, Vol. 81, 1 (2010), 21--35. http://www.springerlink.com/index/Y277128518468756.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cai-Ping Yan, Chi-Man Pun, and Xiao-Chen Yuan. 2016. Multi-scale image hashing using adaptive local feature extraction for robust tampering detection. Signal Processing Vol. 121 (2016), 1--16. http://www.sciencedirect.com/science/article/pii/S0165168415003709 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics Vol. 2 (2014), 67--78. https://www.transacl.org/ojs/index.php/tacl/article/view/229Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            MM '17: Proceedings of the 25th ACM international conference on Multimedia
            October 2017
            2028 pages
            ISBN:9781450349062
            DOI:10.1145/3123266

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 October 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

            Upcoming Conference

            MM '24
            MM '24: The 32nd ACM International Conference on Multimedia
            October 28 - November 1, 2024
            Melbourne , VIC , Australia

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader