Skip to main content
Erschienen in: International Journal of Computer Vision 4/2019

19.08.2018

Second-order Temporal Pooling for Action Recognition

verfasst von: Anoop Cherian, Stephen Gould

Erschienen in: International Journal of Computer Vision | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
As we fine-tune the VGG network from a pre-trained ImageNet model, we use \(\beta = 3\) for SMAID in our implementation.
 
2
With a slight abuse of previously introduced notations, we assume T to be raw feature trajectories without any scaling or normalization.
 
Literatur
Zurück zum Zitat Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2006). Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421.CrossRef Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2006). Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421.CrossRef
Zurück zum Zitat Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding, pp 29–39. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding, pp 29–39.
Zurück zum Zitat Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.
Zurück zum Zitat Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. IEEE: In ICCV. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. IEEE: In ICCV.
Zurück zum Zitat Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.
Zurück zum Zitat Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR. Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR.
Zurück zum Zitat Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 4724–4733. IEEE. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 4724–4733. IEEE.
Zurück zum Zitat Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV.
Zurück zum Zitat Chaquet, J. M., Carmona, E. J., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.CrossRef Chaquet, J. M., Carmona, E. J., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.CrossRef
Zurück zum Zitat Chatfield, K., Simonyan, K., Vedaldi, A, & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Chatfield, K., Simonyan, K., Vedaldi, A, & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:​1405.​3531.
Zurück zum Zitat Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS. Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.
Zurück zum Zitat Cherian, A., Sra, S., Banerjee, A., & Papanikolopoulos, N. (2013). Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9), 2161–2174.CrossRef Cherian, A., Sra, S., Banerjee, A., & Papanikolopoulos, N. (2013). Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9), 2161–2174.CrossRef
Zurück zum Zitat Cherian, A., Fernando, B., Harandi, M., & Gould, S. (2017a). Generalized rank pooling for action recognition. In CVPR. Cherian, A., Fernando, B., Harandi, M., & Gould, S. (2017a). Generalized rank pooling for action recognition. In CVPR.
Zurück zum Zitat Cherian, A., Koniusz, P., & Gould, S. (2017b). Higher-order pooling of CNN features via kernel linerization for action recognition. In WACV. Cherian, A., Koniusz, P., & Gould, S. (2017b). Higher-order pooling of CNN features via kernel linerization for action recognition. In WACV.
Zurück zum Zitat Cherian, A., Sra, S., Gould, S., & Hartley, R. (2018). Non-linear temporal subspace representations for activity recognition. In CVPR, pp 2197–2206. Cherian, A., Sra, S., Gould, S., & Hartley, R. (2018). Non-linear temporal subspace representations for activity recognition. In CVPR, pp 2197–2206.
Zurück zum Zitat Chéron, G., Laptev, I., & Schmid, C.. (2015). P-CNN: Pose-based CNN features for action recognition. arXiv preprint arXiv:1506.03607. Chéron, G., Laptev, I., & Schmid, C.. (2015). P-CNN: Pose-based CNN features for action recognition. arXiv preprint arXiv:​1506.​03607.
Zurück zum Zitat Davis, J. W., & Bobick, A. F. (1997). The representation and recognition of human movement using temporal templates. IEEE: In CVPR. Davis, J. W., & Bobick, A. F. (1997). The representation and recognition of human movement using temporal templates. IEEE: In CVPR.
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:​1411.​4389.
Zurück zum Zitat Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In ICCV. Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In ICCV.
Zurück zum Zitat Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In NIPS. Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In NIPS.
Zurück zum Zitat Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016b). Convolutional two-stream network fusion for video action recognition. In CVPR. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016b). Convolutional two-stream network fusion for video action recognition. In CVPR.
Zurück zum Zitat Feichtenhofer, C,, Pinz, A., & Wildes, R. P. (2017). Spatiotemporal multiplier networks for video action recognition. IEEE: In CVPR. Feichtenhofer, C,, Pinz, A., & Wildes, R. P. (2017). Spatiotemporal multiplier networks for video action recognition. IEEE: In CVPR.
Zurück zum Zitat Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015a). Modeling video evolution for action recognition. In CVPR. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015a). Modeling video evolution for action recognition. In CVPR.
Zurück zum Zitat Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015b). Modeling video evolution for action recognition. In CVPR. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015b). Modeling video evolution for action recognition. In CVPR.
Zurück zum Zitat Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.CrossRef Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.CrossRef
Zurück zum Zitat Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, p. 3. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, p. 3.
Zurück zum Zitat Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR. Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.
Zurück zum Zitat Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 4. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 4.
Zurück zum Zitat Guo, K., Ishwar, P., & Konrad, J. (2013). Action recognition from video using feature covariance matrices. In TIP. Guo, K., Ishwar, P., & Konrad, J. (2013). Action recognition from video using feature covariance matrices. In TIP.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. ISSN 0262-8856. Regularization Techniques for High-Dimensional Data Analysis.CrossRef Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. ISSN 0262-8856. Regularization Techniques for High-Dimensional Data Analysis.CrossRef
Zurück zum Zitat Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–401.MathSciNetCrossRefMATH Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–401.MathSciNetCrossRefMATH
Zurück zum Zitat Huang, Z., & Van Gool, L. (2017). A riemannian network for spd matrix learning. In AAAI. Huang, Z., & Van Gool, L. (2017). A riemannian network for spd matrix learning. In AAAI.
Zurück zum Zitat Ionescu, C., Vantzos, O. & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV. Ionescu, C., Vantzos, O. & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV.
Zurück zum Zitat Jebara, T., & Kondor, R. (2003). Bhattacharyya and expected likelihood kernels. In Learning theory and kernel machines, pp. 57–71. Springer. Jebara, T., & Kondor, R. (2003). Bhattacharyya and expected likelihood kernels. In Learning theory and kernel machines, pp. 57–71. Springer.
Zurück zum Zitat Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.
Zurück zum Zitat Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. PAMI, 33(1), 117–128.CrossRef Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. PAMI, 33(1), 117–128.CrossRef
Zurück zum Zitat Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, Michael J. (2013). Towards understanding action recognition. In ICCV. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, Michael J. (2013). Towards understanding action recognition. In ICCV.
Zurück zum Zitat Ji, S., Wei, X., Yang, M., & Kai, Y. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.CrossRef Ji, S., Wei, X., Yang, M., & Kai, Y. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.CrossRef
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, Li (2014). Large-scale video classification with convolutional neural networks. In CVPR. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, Li (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Zurück zum Zitat Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:​1705.​06950.
Zurück zum Zitat Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC. Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.
Zurück zum Zitat Koniusz, P., Cherian, A., & Porikli, F. (2016). Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV. Koniusz, P., Cherian, A., & Porikli, F. (2016). Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Zurück zum Zitat Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. IEEE: In ICCV. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. IEEE: In ICCV.
Zurück zum Zitat Lan, T., Chen, T.-C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In ECCV. Lan, T., Chen, T.-C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In ECCV.
Zurück zum Zitat Lan, T., Zhu, Y., Zamir Roshan, A., & Savarese, S. (2015). Action recognition by hierarchical mid-level action elements. In ICCV. Lan, T., Zhu, Y., Zamir Roshan, A., & Savarese, S. (2015). Action recognition by hierarchical mid-level action elements. In ICCV.
Zurück zum Zitat Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRef Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRef
Zurück zum Zitat Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR. Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR.
Zurück zum Zitat Lei, J., Ren, X, & Fox, D. (2012). Fine-grained kitchen activity recognition using RGB-D. In ACM Conference on Ubiquitous Computing. Lei, J., Ren, X, & Fox, D. (2012). Fine-grained kitchen activity recognition using RGB-D. In ACM Conference on Ubiquitous Computing.
Zurück zum Zitat Li, P., Wang, Q., Zuo, W., & Zhang, L. (2013). Log-euclidean kernels for sparse representation and dictionary learning. In ICCV. Li, P., Wang, Q., Zuo, W., & Zhang, L. (2013). Log-euclidean kernels for sparse representation and dictionary learning. In ICCV.
Zurück zum Zitat Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C. et al. (2018). Moments in time dataset: One million videos for event understanding. arXiv preprint arXiv:1801.03150. Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C. et al. (2018). Moments in time dataset: One million videos for event understanding. arXiv preprint arXiv:​1801.​03150.
Zurück zum Zitat Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV, Springer. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV, Springer.
Zurück zum Zitat Ni, B., Paramathayalan, V. R., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR. Ni, B., Paramathayalan, V. R., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.
Zurück zum Zitat Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV. Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Zurück zum Zitat Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML.
Zurück zum Zitat Peng, X., Zou, C., Qiao, Y., & Qiang, P. (2014). Action recognition with stacked fisher vectors. In ECCV, Springer. Peng, X., Zou, C., Qiao, Y., & Qiang, P. (2014). Action recognition with stacked fisher vectors. In ECCV, Springer.
Zurück zum Zitat Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In CVIU. Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In CVIU.
Zurück zum Zitat Pennec, X., Fillard, P., & Ayache, N. (2006). A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1), 41–66.CrossRefMATH Pennec, X., Fillard, P., & Ayache, N. (2006). A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1), 41–66.CrossRefMATH
Zurück zum Zitat Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR. Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.
Zurück zum Zitat Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, (pp. 678–689). Springer. Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, (pp. 678–689). Springer.
Zurück zum Zitat Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.CrossRef Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.CrossRef
Zurück zum Zitat Ren, S,, He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, (pp. 91–99). Ren, S,, He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, (pp. 91–99).
Zurück zum Zitat Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.
Zurück zum Zitat Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2015). Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv preprint arXiv:1502.06648. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2015). Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv preprint arXiv:​1502.​06648.
Zurück zum Zitat Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through context-free grammar based representation. In CVPR. Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through context-free grammar based representation. In CVPR.
Zurück zum Zitat Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Zurück zum Zitat Soomro, K., Zamir, A. R, & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Soomro, K., Zamir, A. R, & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:​1212.​0402.
Zurück zum Zitat Sra, S. (2011). Positive definite matrices and the symmetric stein divergence. Technical report. Sra, S. (2011). Positive definite matrices and the symmetric stein divergence. Technical report.
Zurück zum Zitat Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.
Zurück zum Zitat Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR. Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.
Zurück zum Zitat Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.
Zurück zum Zitat Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In CVPR. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In CVPR.
Zurück zum Zitat Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.
Zurück zum Zitat Tran, D., Bourdev, L., D., Fergus, R., Torresani, L., & Paluri, M.. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV. Tran, D., Bourdev, L., D., Fergus, R., Torresani, L., & Paluri, M.. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Zurück zum Zitat Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. PAMI, 34(3), 480–492.CrossRef Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. PAMI, 34(3), 480–492.CrossRef
Zurück zum Zitat Wang, C., Wang, Y., & Yuille, A. L. (2013a). An approach to pose-based action recognition. In CVPR. Wang, C., Wang, Y., & Yuille, A. L. (2013a). An approach to pose-based action recognition. In CVPR.
Zurück zum Zitat Wang, H, & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV. Wang, H, & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Zurück zum Zitat Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013b). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.MathSciNetCrossRef Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013b). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.MathSciNetCrossRef
Zurück zum Zitat Wang, J., Cherian, A., & Porikli, F. (2017). Ordered pooling of optical flow sequences for action recognition. In WACV. Wang, J., Cherian, A., & Porikli, F. (2017). Ordered pooling of optical flow sequences for action recognition. In WACV.
Zurück zum Zitat Wang, J., Cherian, A., Porikli, F., & Gould, S. (2018). Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1149–1158). Wang, J., Cherian, A., Porikli, F., & Gould, S. (2018). Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1149–1158).
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Zurück zum Zitat Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR. Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.
Zurück zum Zitat Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR. Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.
Zurück zum Zitat Yao, A., Gall, J., Fanelli, G., & Van Gool, L. J. (2011a). Does human action recognition benefit from pose estimation?. In BMVC. Yao, A., Gall, J., Fanelli, G., & Van Gool, L. J. (2011a). Does human action recognition benefit from pose estimation?. In BMVC.
Zurück zum Zitat Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5 d graph matching. In ECCV. Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5 d graph matching. In ECCV.
Zurück zum Zitat Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011b). Human action recognition by learning bases of action attributes and parts. In ICCV. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011b). Human action recognition by learning bases of action attributes and parts. In ICCV.
Zurück zum Zitat Yuan, C., Hu, W., Li, X., Maybank, S., & Luo, G. (2009). Human action recognition under log-euclidean riemannian metric. In ACCV. Yuan, C., Hu, W., Li, X., Maybank, S., & Luo, G. (2009). Human action recognition under log-euclidean riemannian metric. In ACCV.
Zurück zum Zitat Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.
Zurück zum Zitat Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In ECCV. Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In ECCV.
Zurück zum Zitat Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR. Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.
Zurück zum Zitat Zisserman, A., Carreira, J., Simonyan, K., Kay, W., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T. et al. (2017). The kinetics human action video dataset. Zisserman, A., Carreira, J., Simonyan, K., Kay, W., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T. et al. (2017). The kinetics human action video dataset.
Zurück zum Zitat Zuffi, S., & Black, M. J. (2013). Puppet flow. IJCV, 101(3), 437–458.CrossRef Zuffi, S., & Black, M. J. (2013). Puppet flow. IJCV, 101(3), 437–458.CrossRef
Metadaten
Titel
Second-order Temporal Pooling for Action Recognition
verfasst von
Anoop Cherian
Stephen Gould
Publikationsdatum
19.08.2018
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 4/2019
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-018-1111-5

Weitere Artikel der Ausgabe 4/2019

International Journal of Computer Vision 4/2019 Zur Ausgabe

Premium Partner