research-article

Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

Authors:
Quanzeng You

University of Rochester, Rochester, NY, USA

University of Rochester, Rochester, NY, USA
View Profile

,
Liangliang Cao

Yahoo Labs, New York, NY, USA

Yahoo Labs, New York, NY, USA
View Profile

,
Hailin Jin

Adobe, San Jose, CA, USA

Adobe, San Jose, CA, USA
View Profile

,
Jiebo Luo

University of Rochester, Rochester, NY, USA

University of Rochester, Rochester, NY, USA
View Profile

MM '16: Proceedings of the 24th ACM international conference on MultimediaOctober 2016Pages 1008–1017https://doi.org/10.1145/2964284.2964288

Published:01 October 2016Publication History

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 1008–1017

ABSTRACT

Sentiment analysis is crucial for extracting social signals from social media content. Due to huge variation in social media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information for robust sentiment analysis. Different from previous work, we believe visual and textual information should be treated jointly in a structural fashion. Our system first builds a semantic tree structure based on sentence parsing, aimed at aligning textual words and image regions for accurate analysis. Next, our system learns a robust joint visual-textual semantic representation by incorporating 1) an attention mechanism with LSTM (long short term memory) and 2) an auxiliary semantic learning task. Extensive experimental results on several known data sets show that our method outperforms existing the state-of-the-art joint models in sentiment analysis. We also investigate different tree-structured LSTM (T-LSTM) variants and analyze the effect of the attention mechanism in order to provide deeper insight on how the attention mechanism helps the learning of the joint visual-textual sentiment classifier.

References

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google Scholar
D. Borth, T. Chen, R. Ji, and S.-F. Chang. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 459--460. ACM, 2013. Google ScholarDigital Library
D. Borth, R. Ji, T. Chen, T. M. Breuel, and S. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 223--232, 2013. Google ScholarDigital Library
D. Cao, R. Ji, D. Lin, and S. Li. A cross-media public sentiment analysis system for microblog. Multimedia Systems, pages 1--8, 2014. Google ScholarDigital Library
X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarCross Ref
H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. Lawrence Zitnick, and G. Zweig. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarCross Ref
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), pages 2121--2129, 2013. Google ScholarDigital Library
F. Gers. Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.Google Scholar
F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3:115--143, 2003. Google ScholarDigital Library
A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 273--278. IEEE, 2013.Google ScholarCross Ref
A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarDigital Library
C. J. Hutto and E. Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM, 2014.Google Scholar
A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128--3137, 2015.Google ScholarCross Ref
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106--1114, 2012. Google ScholarDigital Library
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1188--1196, 2014.Google Scholar
X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 457--466. ACM, 2014. Google ScholarDigital Library
L. Ma, Z. Lu, L. Shang, and H. Li. Multimodal convolutional neural networks for matching image and sentence. In ICCV, December 2015. Google ScholarDigital Library
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS), pages 3111--3119, 2013. Google ScholarDigital Library
S. Nowozin and C. H. Lampert. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3--4):185--365, 2011. Google ScholarDigital Library
B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1--2):1--135, 2008. Google ScholarDigital Library
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1310--1318, 2013.Google Scholar
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, 2014.Google ScholarCross Ref
S. Siersdorfer, E. Minack, F. Deng, and J. S. Hare. Analyzing and predicting sentiment of images on the social web. In Proceedings of the 18th International Conference on Multimedia (ACM MM), pages 715--718, 2010. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2:207--218, 2014.Google ScholarCross Ref
R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 129--136, 2011.Google Scholar
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(1):2949--2980, 2014. Google ScholarDigital Library
I. Sutskever. Training recurrent neural networks. PhD thesis, University of Toronto, 2013. Google ScholarDigital Library
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1--9, 2015.Google ScholarCross Ref
K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1556--1566, July 2015.Google ScholarCross Ref
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, June 2015.Google ScholarCross Ref
M. Wang, D. Cao, L. Li, S. Li, and R. Ji. Microblog sentiment analysis based on cross-media bag-of-words model. In ICIMCS, pages 76:76--76:80. ACM, 2014. Google ScholarDigital Library
Y. Wang, Y. Hu, S. Kambhampati, and B. Li. Inferring sentiment from web images with joint inference on visual and social cues: A regulated matrix factorization approach. In ICWSM, pages 473--482, 2015.Google Scholar
Y. Wang, S. Wang, J. Tang, H. Liu, and B. Li. Unsupervised sentiment analysis for social media images. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2378--2379, 2015. Google ScholarDigital Library
J. Weston, S. Bengio, and N. Usunier. WSABIE: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2764--2770, 2011. Google ScholarDigital Library
K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048--2057, 2015.Google ScholarDigital Library
Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 1385--1392, June 2011. Google ScholarDigital Library
Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI, pages 381--388, 2015. Google ScholarDigital Library
Q. You, J. Luo, H. Jin, and J. Yang. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In ACM International Conference on Web Search and Data Ming (WSDM), pages 13--22, 2016. Google ScholarDigital Library
J. Yuan, S. Mcdonough, Q. You, and J. Luo. Sentribute: image sentiment analysis from a mid-level perspective. In WISDOM, page 10, 2013. Google ScholarDigital Library
X. Zhu, P. Sobhani, and H. Guo. Long short-term memory over recursive structures. In ICML, pages 1604--1612, 2015.Google ScholarDigital Library

Index Terms

Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

Recommendations

Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining

Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so ...
Read More
Joint Visual-Textual Sentiment Analysis with Deep Neural Networks
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Sentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so ...
Read More
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attention mechanism
joint visual-textual sentiment analysis
multimodality analysis
tree-structured joint model
Qualifiers
- research-article
Conference

Acceptance Rates
MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 866
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks

MM '16: Proceedings of the 24th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia

Joint Visual-Textual Sentiment Analysis with Deep Neural Networks

Joint sentiment/topic model for sentiment analysis