skip to main content
10.1145/2964284.2964288acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

Published:01 October 2016Publication History

ABSTRACT

Sentiment analysis is crucial for extracting social signals from social media content. Due to huge variation in social media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information for robust sentiment analysis. Different from previous work, we believe visual and textual information should be treated jointly in a structural fashion. Our system first builds a semantic tree structure based on sentence parsing, aimed at aligning textual words and image regions for accurate analysis. Next, our system learns a robust joint visual-textual semantic representation by incorporating 1) an attention mechanism with LSTM (long short term memory) and 2) an auxiliary semantic learning task. Extensive experimental results on several known data sets show that our method outperforms existing the state-of-the-art joint models in sentiment analysis. We also investigate different tree-structured LSTM (T-LSTM) variants and analyze the effect of the attention mechanism in order to provide deeper insight on how the attention mechanism helps the learning of the joint visual-textual sentiment classifier.

References

  1. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google ScholarGoogle Scholar
  2. D. Borth, T. Chen, R. Ji, and S.-F. Chang. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 459--460. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Borth, R. Ji, T. Chen, T. M. Breuel, and S. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 223--232, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Cao, R. Ji, D. Lin, and S. Li. A cross-media public sentiment analysis system for microblog. Multimedia Systems, pages 1--8, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. Lawrence Zitnick, and G. Zweig. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarGoogle ScholarCross RefCross Ref
  7. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), pages 2121--2129, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Gers. Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.Google ScholarGoogle Scholar
  9. F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3:115--143, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 273--278. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645--6649. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. J. Hutto and E. Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM, 2014.Google ScholarGoogle Scholar
  14. A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128--3137, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  15. R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.Google ScholarGoogle Scholar
  16. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106--1114, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1188--1196, 2014.Google ScholarGoogle Scholar
  18. X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 457--466. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Ma, Z. Lu, L. Shang, and H. Li. Multimodal convolutional neural networks for matching image and sentence. In ICCV, December 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.Google ScholarGoogle Scholar
  21. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS), pages 3111--3119, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Nowozin and C. H. Lampert. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3--4):185--365, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1--2):1--135, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1310--1318, 2013.Google ScholarGoogle Scholar
  25. J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. S. Siersdorfer, E. Minack, F. Deng, and J. S. Hare. Analyzing and predicting sentiment of images on the social web. In Proceedings of the 18th International Conference on Multimedia (ACM MM), pages 715--718, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google ScholarGoogle Scholar
  28. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2:207--218, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  29. R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 129--136, 2011.Google ScholarGoogle Scholar
  30. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(1):2949--2980, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. I. Sutskever. Training recurrent neural networks. PhD thesis, University of Toronto, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1--9, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  33. K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1556--1566, July 2015.Google ScholarGoogle ScholarCross RefCross Ref
  34. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, June 2015.Google ScholarGoogle ScholarCross RefCross Ref
  35. M. Wang, D. Cao, L. Li, S. Li, and R. Ji. Microblog sentiment analysis based on cross-media bag-of-words model. In ICIMCS, pages 76:76--76:80. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Wang, Y. Hu, S. Kambhampati, and B. Li. Inferring sentiment from web images with joint inference on visual and social cues: A regulated matrix factorization approach. In ICWSM, pages 473--482, 2015.Google ScholarGoogle Scholar
  37. Y. Wang, S. Wang, J. Tang, H. Liu, and B. Li. Unsupervised sentiment analysis for social media images. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2378--2379, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Weston, S. Bengio, and N. Usunier. WSABIE: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2764--2770, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048--2057, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 1385--1392, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI, pages 381--388, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Q. You, J. Luo, H. Jin, and J. Yang. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In ACM International Conference on Web Search and Data Ming (WSDM), pages 13--22, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Yuan, S. Mcdonough, Q. You, and J. Luo. Sentribute: image sentiment analysis from a mid-level perspective. In WISDOM, page 10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. X. Zhu, P. Sobhani, and H. Guo. Long short-term memory over recursive structures. In ICML, pages 1604--1612, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '16: Proceedings of the 24th ACM international conference on Multimedia
          October 2016
          1542 pages
          ISBN:9781450336031
          DOI:10.1145/2964284

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader