Skip to main content
Erschienen in: International Journal of Computer Vision 2-4/2018

13.09.2017

Transferring Deep Object and Scene Representations for Event Recognition in Still Images

verfasst von: Limin Wang, Zhe Wang, Yu Qiao, Luc Van Gool

Erschienen in: International Journal of Computer Vision | Ausgabe 2-4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper addresses the problem of image-based event recognition by transferring deep representations learned from object and scene datasets. First we empirically investigate the correlation of the concepts of object, scene, and event, thus motivating our representation transfer methods. Based on this empirical study, we propose an iterative selection method to identify a subset of object and scene classes deemed most relevant for representation transfer. Afterwards, we develop three transfer techniques: (1) initialization-based transfer, (2) knowledge-based transfer, and (3) data-based transfer. These newly designed transfer techniques exploit multitask learning frameworks to incorporate extra knowledge from other networks or additional datasets into the fine-tuning procedure of event CNNs. These multitask learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. We perform experiments on four event recognition benchmarks: the ChaLearn LAP Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition, the UIUC Sports Event dataset, and the Photo Event Collection dataset. The experimental results show that our proposed algorithm successfully transfers object and scene representations towards the event dataset and achieves the current state-of-the-art performance on all considered datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). From generic to specific deep representations for visual recognition. In CVPR Workshop on DeepVision, pp. 36–45. Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). From generic to specific deep representations for visual recognition. In CVPR Workshop on DeepVision, pp. 36–45.
Zurück zum Zitat Baro, X., Gonzalez, J., Fabian, J., Bautista, M. A., Oliu, M., Jair Escalante, H., Guyon, I., & Escalera, S .(2015). Chalearn looking at people 2015 challenges: Action spotting and cultural event recognition. In CVPR Workshop on ChaLearn Looking at People, pp. 1–9. Baro, X., Gonzalez, J., Fabian, J., Bautista, M. A., Oliu, M., Jair Escalante, H., Guyon, I., & Escalera, S .(2015). Chalearn looking at people 2015 challenges: Action spotting and cultural event recognition. In CVPR Workshop on ChaLearn Looking at People, pp. 1–9.
Zurück zum Zitat Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR, pp. 2243–2250. Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR, pp. 2243–2250.
Zurück zum Zitat Bossard, L., Guillaumin, M., & Gool, L. J. V. (2013). Event recognition in photo collections with a stopwatch HMM. In ICCV, pp. 1193–1200. Bossard, L., Guillaumin, M., & Gool, L. J. V. (2013). Event recognition in photo collections with a stopwatch HMM. In ICCV, pp. 1193–1200.
Zurück zum Zitat Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC, pp. 1–12. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC, pp. 1–12.
Zurück zum Zitat Cooper, M. L., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In ACM Multimedia, pp. 364–373. Cooper, M. L., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In ACM Multimedia, pp. 364–373.
Zurück zum Zitat Das, A., Dasgupta, A., & Kumar, R. (2012). Selecting diverse features via spectral regularization. In NIPS, pp. 1592–1600. Das, A., Dasgupta, A., & Kumar, R. (2012). Selecting diverse features via spectral regularization. In NIPS, pp. 1592–1600.
Zurück zum Zitat Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person-object interactions for action recognition in still images. In NIPS, pp. 1503–1511. Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person-object interactions for action recognition in still images. In NIPS, pp. 1503–1511.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255.
Zurück zum Zitat Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In ECCV, pp. 158–172. Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In ECCV, pp. 158–172.
Zurück zum Zitat Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.CrossRef Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.CrossRef
Zurück zum Zitat Ebadollahi, S., Xie, L., Chang, S., & Smith, J. R. (2006). Visual event detection using multi-dimensional concept dynamics. In ICME, pp. 881–884. Ebadollahi, S., Xie, L., Chang, S., & Smith, J. R. (2006). Visual event detection using multi-dimensional concept dynamics. In ICME, pp. 881–884.
Zurück zum Zitat Escalera, S., Fabian, J., Pardo, P., Baro, X., Gonzalez, J., Escalante, H. J., Misevic, D., Steiner, U., & Guyon, I. (2015). Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In ICCV Workshop on ChaLearn Looking at People, pp. 1–9. Escalera, S., Fabian, J., Pardo, P., Baro, X., Gonzalez, J., Escalante, H. J., Misevic, D., Steiner, U., & Guyon, I. (2015). Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In ICCV Workshop on ChaLearn Looking at People, pp. 1–9.
Zurück zum Zitat Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRef
Zurück zum Zitat Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV, pp. 2960–2967. Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV, pp. 2960–2967.
Zurück zum Zitat Gan, C., Wang, N., Yang, Y., Yeung, D., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577. Gan, C., Wang, N., Yang, Y., Yeung, D., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577.
Zurück zum Zitat Gao, B., Wei, X., Wu, J., & Lin, W. (2015). Deep spatial pyramid: The devil is once again in the details. CoRR abs/1504.05277. Gao, B., Wei, X., Wu, J., & Lin, W. (2015). Deep spatial pyramid: The devil is once again in the details. CoRR abs/1504.05277.
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587.
Zurück zum Zitat Gkioxari, G., Girshick, R. B., & Malik, J. (2015). Contextual action recognition with r*cnn. In ICCV, pp. 1080–1088. Gkioxari, G., Girshick, R. B., & Malik, J. (2015). Contextual action recognition with r*cnn. In ICCV, pp. 1080–1088.
Zurück zum Zitat Gong, B., Grauman, K., & Sha, F. (2014). Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision, 109(1–2), 3–27.MathSciNetCrossRefMATH Gong, B., Grauman, K., & Sha, F. (2014). Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision, 109(1–2), 3–27.MathSciNetCrossRefMATH
Zurück zum Zitat Habibian, A., & Snoek, C. G. M. (2014). Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 124, 110–122.CrossRef Habibian, A., & Snoek, C. G. M. (2014). Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 124, 110–122.CrossRef
Zurück zum Zitat Hauptmann, A. G., Christel, M. G., & Yan, R. (2008). Video retrieval based on semantic concepts. Proceedings of the IEEE, 96(4), 602–622.CrossRef Hauptmann, A. G., Christel, M. G., & Yan, R. (2008). Video retrieval based on semantic concepts. Proceedings of the IEEE, 96(4), 602–622.CrossRef
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun. J. (2015). Deep residual learning for image recognition. CoRR abs/1512.03385. He, K., Zhang, X., Ren, S., & Sun. J. (2015). Deep residual learning for image recognition. CoRR abs/1512.03385.
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456.
Zurück zum Zitat Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In Computer Vision—ECCV 2012–12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV, pp. 430–444. Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In Computer Vision—ECCV 2012–12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV, pp. 430–444.
Zurück zum Zitat Jain, M., van Gemert, J. C., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, pp. 46–55. Jain, M., van Gemert, J. C., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, pp. 46–55.
Zurück zum Zitat Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. CoRR abs/1408.5093. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. CoRR abs/1408.5093.
Zurück zum Zitat Juneja, M., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR, pp. 923–930. Juneja, M., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR, pp. 923–930.
Zurück zum Zitat Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.CrossRef Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.CrossRef
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114.
Zurück zum Zitat Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pp. 1785–1792. Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pp. 1785–1792.
Zurück zum Zitat LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.CrossRef
Zurück zum Zitat Li, L., & Li, F. (2007). What, where and who? classifying events by scene and object recognition. In ICCV, pp. 1–8. Li, L., & Li, F. (2007). What, where and who? classifying events by scene and object recognition. In ICCV, pp. 1–8.
Zurück zum Zitat Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H. S. (2013). Video event recognition using concept attributes. In WACV, pp. 339–346. Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H. S. (2013). Video event recognition using concept attributes. In WACV, pp. 339–346.
Zurück zum Zitat Liu, M., Liu, X., Li, Y., Chen, X., Hauptmann, A. G., & Shan, S. (2015). Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 32–37. Liu, M., Liu, X., Li, Y., Chen, X., Hauptmann, A. G., & Shan, S. (2015). Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 32–37.
Zurück zum Zitat Ma, Z., Yang, Y., Sebe, N., & Hauptmann, A. G. (2014). Knowledge adaptation with partiallyshared features for event detectionusing few exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9), 1789–1802.CrossRef Ma, Z., Yang, Y., Sebe, N., & Hauptmann, A. G. (2014). Knowledge adaptation with partiallyshared features for event detectionusing few exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9), 1789–1802.CrossRef
Zurück zum Zitat Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR, pp. 2929–2936. Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR, pp. 2929–2936.
Zurück zum Zitat Mazloom, M., Gavves, E., & Snoek, C. G. M. (2014). Conceptlets: Selective semantics for classifying video events. IEEE Transactions on Multimedia, 16(8), 2214–2228.CrossRef Mazloom, M., Gavves, E., & Snoek, C. G. M. (2014). Conceptlets: Selective semantics for classifying video events. IEEE Transactions on Multimedia, 16(8), 2214–2228.CrossRef
Zurück zum Zitat Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724. Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724.
Zurück zum Zitat Park, S., & Kwak, N. (2015). Cultural event recognition by subregion classification with convolutional neural network. In CVPR Workshop on ChaLearn Looking at People, pp. 45–50. Park, S., & Kwak, N. (2015). Cultural event recognition by subregion classification with convolutional neural network. In CVPR Workshop on ChaLearn Looking at People, pp. 45–50.
Zurück zum Zitat Ramanathan, V., Tang, K. D., Mori, G., & Li, F. (2015). Learning temporal embeddings for complex video analysis. In ICCV, pp. 4471–4479. Ramanathan, V., Tang, K. D., Mori, G., & Li, F. (2015). Learning temporal embeddings for complex video analysis. In ICCV, pp. 4471–4479.
Zurück zum Zitat Rothe, R., Timofte, R., & Van Gool, L. (2015). Dldr: Deep linear discriminative retrieval for cultural event classification from a single image. In The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 53–60. Rothe, R., Timofte, R., & Van Gool, L. (2015). Dldr: Deep linear discriminative retrieval for cultural event classification from a single image. In The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 53–60.
Zurück zum Zitat Salvador, A., Zeppelzauer, M., Manchon-Vizuete, D., Calafell, A., & Giro-i Nieto, X. (2015). Cultural event recognition with visual convnets and temporal models. In CVPR Workshop on ChaLearn Looking at People, pp. 36–44. Salvador, A., Zeppelzauer, M., Manchon-Vizuete, D., Calafell, A., & Giro-i Nieto, X. (2015). Cultural event recognition with visual convnets and temporal models. In CVPR Workshop on ChaLearn Looking at People, pp. 36–44.
Zurück zum Zitat Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In CVPR Workshop on DeepVision, pp. 806–813. Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In CVPR Workshop on DeepVision, pp. 806–813.
Zurück zum Zitat Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pp. 467–482. Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pp. 467–482.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS, pp. 568–576. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS, pp. 568–576.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–8. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–8.
Zurück zum Zitat Tang, K. D., Li, F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp. 1250–1257. Tang, K. D., Li, F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp. 1250–1257.
Zurück zum Zitat Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, pp. 1521–1528. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, pp. 1521–1528.
Zurück zum Zitat Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultanously deep transfer across domains and tasks. In ICCV, pp. 4068–4076. Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultanously deep transfer across domains and tasks. In ICCV, pp. 4068–4076.
Zurück zum Zitat Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRef Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRef
Zurück zum Zitat Vu, T., Olsson, C., Laptev, I., Oliva, A., & Sivic, J. (2014). Predicting actions from static scenes. In ECCV, pp. 421–436. Vu, T., Olsson, C., Laptev, I., Oliva, A., & Sivic, J. (2014). Predicting actions from static scenes. In ECCV, pp. 421–436.
Zurück zum Zitat Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017). Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing, 26(4), 2055–2068. Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017). Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing, 26(4), 2055–2068.
Zurück zum Zitat Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRef Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRef
Zurück zum Zitat Wang, H., Oneata, D., Verbeek, J. J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219–238.MathSciNetCrossRef Wang, H., Oneata, D., Verbeek, J. J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219–238.MathSciNetCrossRef
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2015a). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pp. 4305–4314. Wang, L., Qiao, Y., & Tang, X. (2015a). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pp. 4305–4314.
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2016b). MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3), 254–271.MathSciNetCrossRef Wang, L., Qiao, Y., & Tang, X. (2016b). MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3), 254–271.MathSciNetCrossRef
Zurück zum Zitat Wang, L., Wang, Z., Du, W., & Qiao, Y. (2015b). Object-scene convolutional neural networks for event recognition in images. In CVPR Workshop on ChaLearn Looking at People, pp. 30–35. Wang, L., Wang, Z., Du, W., & Qiao, Y. (2015b). Object-scene convolutional neural networks for event recognition in images. In CVPR Workshop on ChaLearn Looking at People, pp. 30–35.
Zurück zum Zitat Wang, L., Wang, Z., Guo, S., & Qiao, Y. (2015c). Better exploiting os-cnns for better event recognition in images. In ICCV Workshop on ChaLearn Looking at People, pp. 45–52. Wang, L., Wang, Z., Guo, S., & Qiao, Y. (2015c). Better exploiting os-cnns for better event recognition in images. In ICCV Workshop on ChaLearn Looking at People, pp. 45–52.
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., & Qiao, Y. (2015d). Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159. Wang, L., Xiong, Y., Wang, Z., & Qiao, Y. (2015d). Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159.
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016c). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pp. 20–36. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016c). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pp. 20–36.
Zurück zum Zitat Wei, X. S., Gao, B. B., & Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 38–44. Wei, X. S., Gao, B. B., & Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 38–44.
Zurück zum Zitat Xiong, Y., Zhu, K., Lin, D., & Tang, X. (2015). Recognize complex events from static images by fusing deep channels. In CVPR, pp. 1600–1609. Xiong, Y., Zhu, K., Lin, D., & Tang, X. (2015). Recognize complex events from static images by fusing deep channels. In CVPR, pp. 1600–1609.
Zurück zum Zitat Yan, Y., Yang, Y., Shen, H., Meng, D., Liu, G., Hauptmann, A. G., & Sebe, N. (2015). Complex event detection via event oriented dictionary learning. In AAAI, pp. 3841–3847. Yan, Y., Yang, Y., Shen, H., Meng, D., Liu, G., Hauptmann, A. G., & Sebe, N. (2015). Complex event detection via event oriented dictionary learning. In AAAI, pp. 3841–3847.
Zurück zum Zitat Yang, Y., Yang, Y., Huang, Z., Liu, J., & Ma, Z. (2012). Robust cross-media transfer for visual event detection. In Proceedings of the 20th ACM International Conference on Multimedia, pp. 1045–1048. Yang, Y., Yang, Y., Huang, Z., Liu, J., & Ma, Z. (2012). Robust cross-media transfer for visual event detection. In Proceedings of the 20th ACM International Conference on Multimedia, pp. 1045–1048.
Zurück zum Zitat Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Li, F. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV, pp. 1331–1338. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Li, F. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV, pp. 1331–1338.
Zurück zum Zitat Yao, B., & Li, F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, pp. 9–16. Yao, B., & Li, F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, pp. 9–16.
Zurück zum Zitat Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pp. 702–709. Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pp. 702–709.
Zurück zum Zitat Zheng, J., Jiang, Z., Chellappa, R., & Phillips, P. J. (2014). Submodular attribute selection for action recognition in video. In NIPS, pp 1341–1349. Zheng, J., Jiang, Z., Chellappa, R., & Phillips, P. J. (2014). Submodular attribute selection for action recognition in video. In NIPS, pp 1341–1349.
Zurück zum Zitat Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. CoRR abs/1512.04150. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. CoRR abs/1512.04150.
Zurück zum Zitat Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS, pp. 487–495. Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS, pp. 487–495.
Metadaten
Titel
Transferring Deep Object and Scene Representations for Event Recognition in Still Images
verfasst von
Limin Wang
Zhe Wang
Yu Qiao
Luc Van Gool
Publikationsdatum
13.09.2017
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 2-4/2018
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-017-1043-5

Weitere Artikel der Ausgabe 2-4/2018

International Journal of Computer Vision 2-4/2018 Zur Ausgabe