ABSTRACT
Sentiment analysis is crucial for extracting social signals from social media content. Due to huge variation in social media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information for robust sentiment analysis. Different from previous work, we believe visual and textual information should be treated jointly in a structural fashion. Our system first builds a semantic tree structure based on sentence parsing, aimed at aligning textual words and image regions for accurate analysis. Next, our system learns a robust joint visual-textual semantic representation by incorporating 1) an attention mechanism with LSTM (long short term memory) and 2) an auxiliary semantic learning task. Extensive experimental results on several known data sets show that our method outperforms existing the state-of-the-art joint models in sentiment analysis. We also investigate different tree-structured LSTM (T-LSTM) variants and analyze the effect of the attention mechanism in order to provide deeper insight on how the attention mechanism helps the learning of the joint visual-textual sentiment classifier.
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.Google Scholar
- D. Borth, T. Chen, R. Ji, and S.-F. Chang. Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 459--460. ACM, 2013. Google ScholarDigital Library
- D. Borth, R. Ji, T. Chen, T. M. Breuel, and S. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 223--232, 2013. Google ScholarDigital Library
- D. Cao, R. Ji, D. Lin, and S. Li. A cross-media public sentiment analysis system for microblog. Multimedia Systems, pages 1--8, 2014. Google ScholarDigital Library
- X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarCross Ref
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. Lawrence Zitnick, and G. Zweig. From captions to visual concepts and back. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.Google ScholarCross Ref
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), pages 2121--2129, 2013. Google ScholarDigital Library
- F. Gers. Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.Google Scholar
- F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks. The Journal of Machine Learning Research, 3:115--143, 2003. Google ScholarDigital Library
- A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 273--278. IEEE, 2013.Google ScholarCross Ref
- A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997. Google ScholarDigital Library
- C. J. Hutto and E. Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM, 2014.Google Scholar
- A. Karpathy and F. Li. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128--3137, 2015.Google ScholarCross Ref
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106--1114, 2012. Google ScholarDigital Library
- Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1188--1196, 2014.Google Scholar
- X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 457--466. ACM, 2014. Google ScholarDigital Library
- L. Ma, Z. Lu, L. Shang, and H. Li. Multimodal convolutional neural networks for matching image and sentence. In ICCV, December 2015. Google ScholarDigital Library
- J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (NIPS), pages 3111--3119, 2013. Google ScholarDigital Library
- S. Nowozin and C. H. Lampert. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3--4):185--365, 2011. Google ScholarDigital Library
- B. Pang and L. Lee. Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2(1--2):1--135, 2008. Google ScholarDigital Library
- R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1310--1318, 2013.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543, 2014.Google ScholarCross Ref
- S. Siersdorfer, E. Minack, F. Deng, and J. S. Hare. Analyzing and predicting sentiment of images on the social web. In Proceedings of the 18th International Conference on Multimedia (ACM MM), pages 715--718, 2010. Google ScholarDigital Library
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.Google Scholar
- R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2:207--218, 2014.Google ScholarCross Ref
- R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 129--136, 2011.Google Scholar
- N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(1):2949--2980, 2014. Google ScholarDigital Library
- I. Sutskever. Training recurrent neural networks. PhD thesis, University of Toronto, 2013. Google ScholarDigital Library
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1--9, 2015.Google ScholarCross Ref
- K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 1556--1566, July 2015.Google ScholarCross Ref
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, June 2015.Google ScholarCross Ref
- M. Wang, D. Cao, L. Li, S. Li, and R. Ji. Microblog sentiment analysis based on cross-media bag-of-words model. In ICIMCS, pages 76:76--76:80. ACM, 2014. Google ScholarDigital Library
- Y. Wang, Y. Hu, S. Kambhampati, and B. Li. Inferring sentiment from web images with joint inference on visual and social cues: A regulated matrix factorization approach. In ICWSM, pages 473--482, 2015.Google Scholar
- Y. Wang, S. Wang, J. Tang, H. Liu, and B. Li. Unsupervised sentiment analysis for social media images. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2378--2379, 2015. Google ScholarDigital Library
- J. Weston, S. Bengio, and N. Usunier. WSABIE: scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI), pages 2764--2770, 2011. Google ScholarDigital Library
- K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048--2057, 2015.Google ScholarDigital Library
- Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 1385--1392, June 2011. Google ScholarDigital Library
- Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI, pages 381--388, 2015. Google ScholarDigital Library
- Q. You, J. Luo, H. Jin, and J. Yang. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In ACM International Conference on Web Search and Data Ming (WSDM), pages 13--22, 2016. Google ScholarDigital Library
- J. Yuan, S. Mcdonough, Q. You, and J. Luo. Sentribute: image sentiment analysis from a mid-level perspective. In WISDOM, page 10, 2013. Google ScholarDigital Library
- X. Zhu, P. Sobhani, and H. Guo. Long short-term memory over recursive structures. In ICML, pages 1604--1612, 2015.Google ScholarDigital Library
Index Terms
- Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks
Recommendations
Cross-modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia
WSDM '16: Proceedings of the Ninth ACM International Conference on Web Search and Data MiningSentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so ...
Joint Visual-Textual Sentiment Analysis with Deep Neural Networks
MM '15: Proceedings of the 23rd ACM international conference on MultimediaSentiment analysis of online user generated content is important for many social media analytics tasks. Researchers have largely relied on textual sentiment analysis to develop systems to predict political elections, measure economic indicators, and so ...
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementSentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Comments