Skip to main content
Top
Published in: Machine Vision and Applications 7/2018

18-07-2018 | Original Article

End-to-end temporal attention extraction and human action recognition

Authors: Hong Zhang, Miao Xin, Shuhang Wang, Yifan Yang, Lei Zhang, Helong Wang

Published in: Machine Vision and Applications | Issue 7/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Visual context is fundamental to understand human actions in videos. However, the discriminative temporal information of videos is usually sparse and most frames are redundant mixed with a large amount of interference information, which may result in redundant computation and recognition failure. Hence, an important question is how to efficiently employ temporal context information. In this paper, we propose a learnable temporal attention mechanism to automatically select important time points from action sequences. We design an unsupervised Recurrent Temporal Sparse Autoencoder (RTSAE) network, which learns to extract sparse keyframes to sharpen discriminative yet to retain descriptive capability, as well to shield interference information. By applying this technique to a dual-stream convolutional neural network, we significantly improve the performance in both accuracy and efficiency. Experiments demonstrate that, with the help of the RTSAE, our method achieves competitive results to state of the art on UCF101 and HMDB51 datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
Statistically, S-Net performs slightly better than T-Net in most cases, however, T-Net is irreplaceable in certain special cases.
 
Literature
1.
go back to reference Abd-Almageed, W.: Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: IEEE International Conference on Image Processing, pp. 3200–3203 (2008) Abd-Almageed, W.: Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: IEEE International Conference on Image Processing, pp. 3200–3203 (2008)
2.
go back to reference Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)MathSciNetCrossRef Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)MathSciNetCrossRef
3.
go back to reference Alex, G.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Berlin (2012) Alex, G.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Berlin (2012)
4.
go back to reference Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014) Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)
5.
go back to reference Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017) Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)
6.
go back to reference Cho, K., Courville, A., Bengio, Y.: Describing multimedia content using attention-based encoder–decoder networks. IEEE Trans. Multimedia 17(11), 1875–1886 (2015)CrossRef Cho, K., Courville, A., Bengio, Y.: Describing multimedia content using attention-based encoder–decoder networks. IEEE Trans. Multimedia 17(11), 1875–1886 (2015)CrossRef
7.
go back to reference Courbariaux, M., Bengio, Y.: Binarynet: training deep neural networks with weights and activations constrained to +1 or \(-\)1. CoRR arXiv:1602.02830 (2016) Courbariaux, M., Bengio, Y.: Binarynet: training deep neural networks with weights and activations constrained to +1 or \(-\)1. CoRR arXiv:​1602.​02830 (2016)
8.
go back to reference De Avila, S.E.F., Lopes, A.P.B., da Luz, A., de Albuquerque Araújo, A.: Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 32(1), 56–68 (2011)CrossRef De Avila, S.E.F., Lopes, A.P.B., da Luz, A., de Albuquerque Araújo, A.: Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 32(1), 56–68 (2011)CrossRef
9.
go back to reference Doersch, C., Gupta, A., Efros, A. A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE Conference on Computer Vision, pp. 1422–1430 (2015) Doersch, C., Gupta, A., Efros, A. A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE Conference on Computer Vision, pp. 1422–1430 (2015)
10.
go back to reference Du, W., Wang, Y., Qiao, Y.: Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans. Image Process. 27(3), 1347–1360 (2018)MathSciNetCrossRef Du, W., Wang, Y., Qiao, Y.: Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans. Image Process. 27(3), 1347–1360 (2018)MathSciNetCrossRef
11.
go back to reference Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016) Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
12.
go back to reference Fengjun, L., Fengjun, L.: Single view human action recognition using key pose matching and Viterbi path searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) Fengjun, L., Fengjun, L.: Single view human action recognition using key pose matching and Viterbi path searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007)
13.
go back to reference Gong, B., Chao, W. L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings: Advances in Neural Information Processing Systems, pp 2069–2077 (2014) Gong, B., Chao, W. L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings: Advances in Neural Information Processing Systems, pp 2069–2077 (2014)
14.
go back to reference Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recognit. 47(10), 3343–3361 (2014)CrossRef Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recognit. 47(10), 3343–3361 (2014)CrossRef
15.
go back to reference Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: European Conference on Computer Vision, pp. 505–520. Springer, Berlin (2014) Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: European Conference on Computer Vision, pp. 505–520. Springer, Berlin (2014)
16.
go back to reference Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: European Conference on Computer Vision, pp 494–507 (2010)CrossRef Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: European Conference on Computer Vision, pp 494–507 (2010)CrossRef
17.
go back to reference Jeff, D., Anne Hendricks, L., Sergio, G., Marcus, R.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015) Jeff, D., Anne Hendricks, L., Sergio, G., Marcus, R.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
18.
go back to reference Jiang, Y., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: Thumos challenge: action recognition with a large number of classes. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2013) Jiang, Y., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: Thumos challenge: action recognition with a large number of classes. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2013)
19.
20.
go back to reference Kar, A., Rai, N., Sikka, K., Sharma, G.: Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 20–28 (2017) Kar, A., Rai, N., Sikka, K., Sharma, G.: Adascan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 20–28 (2017)
21.
go back to reference Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
22.
go back to reference Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: a large video database for human motion recognition. In: High Performance Computing in Science and Engineering, pp. 571–582 (2013) Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: a large video database for human motion recognition. In: High Performance Computing in Science and Engineering, pp. 571–582 (2013)
23.
go back to reference Li, L., Ling, S., Xiantong, Z., Xuelong, L.: Learning discriminative key poses for action recognition. IEEE Trans. Cybern. 43(6), 1860–1870 (2013)CrossRef Li, L., Ling, S., Xiantong, Z., Xuelong, L.: Learning discriminative key poses for action recognition. IEEE Trans. Cybern. 43(6), 1860–1870 (2013)CrossRef
24.
go back to reference Liu, G., Lin, Z., Yu, Y.: Robust subspace segmentation by low-rank representation. In: International Conference on Machine Learning, pp. 663–670 (2010) Liu, G., Lin, Z., Yu, Y.: Robust subspace segmentation by low-rank representation. In: International Conference on Machine Learning, pp. 663–670 (2010)
25.
go back to reference Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognit. 46(7), 1810–1818 (2013)CrossRef Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognit. 46(7), 1810–1818 (2013)CrossRef
27.
go back to reference Mundur, P., Rao, Y., Yesha, Y.: Keyframe-based video summarization using delaunay clustering. Int. J. Digit. Libr. 6(2), 219–232 (2006)CrossRef Mundur, P., Rao, Y., Yesha, Y.: Keyframe-based video summarization using delaunay clustering. Int. J. Digit. Libr. 6(2), 219–232 (2006)CrossRef
28.
go back to reference Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Quénot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceeding of TRECVID, NIST, USA (2015) Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Quénot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceeding of TRECVID, NIST, USA (2015)
29.
go back to reference Pei, W., Baltrušaitis, T., Tax, D.M., Morency, L.P.: Temporal attention-gated model for robust sequence classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 820–829 (2017) Pei, W., Baltrušaitis, T., Tax, D.M., Morency, L.P.: Temporal attention-gated model for robust sequence classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 820–829 (2017)
30.
go back to reference Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In: Computer Vision and Image Understanding (2016)CrossRef Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In: Computer Vision and Image Understanding (2016)CrossRef
31.
go back to reference Piergiovanni, A., Fan, C., Ryoo, M.S.: Learning latent sub-events in activity videos using temporal attention filters. In: AAAI Conference on Artificial Intelligence (2017) Piergiovanni, A., Fan, C., Ryoo, M.S.: Learning latent sub-events in activity videos using temporal attention filters. In: AAAI Conference on Artificial Intelligence (2017)
32.
go back to reference Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. In: Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2016) Ranjan, R., Patel, V.M., Chellappa, R.: Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. In: Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1 (2016)
33.
go back to reference Rowe, J., Friston, K., Frackowiak, R., Passingham, R.: Attention to action: specific modulation of corticocortical interactions in humans. Neuroimage 17(2), 988–998 (2002)CrossRef Rowe, J., Friston, K., Frackowiak, R., Passingham, R.: Attention to action: specific modulation of corticocortical interactions in humans. Neuroimage 17(2), 988–998 (2002)CrossRef
34.
35.
go back to reference Shen, X., Wu, Y.: A unified approach to salient object detection via low rank matrix recovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 853–860 (2012) Shen, X., Wu, Y.: A unified approach to salient object detection via low rank matrix recovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 853–860 (2012)
36.
go back to reference Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
37.
go back to reference Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
38.
go back to reference Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data (2016) Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data (2016)
39.
go back to reference Soomro, K., Roshan Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). Preprint arXiv:1212.0402 Soomro, K., Roshan Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012). Preprint arXiv:​1212.​0402
40.
go back to reference Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMS. In: International Conference on Machine Learning, pp. 843–852 (2015) Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMS. In: International Conference on Machine Learning, pp. 843–852 (2015)
41.
go back to reference Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision, pp. 4489–4497 (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision, pp. 4489–4497 (2015)
42.
go back to reference Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015) Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)
43.
go back to reference Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceeding of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013) Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceeding of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
44.
go back to reference Wang, H., Schmid, C.: Lear-inria submission for the thumos workshop. In: Proceeding of IEEE IEEE International Conference on Computer Vision Workshops, vol. 2 8 (2013) Wang, H., Schmid, C.: Lear-inria submission for the thumos workshop. In: Proceeding of IEEE IEEE International Conference on Computer Vision Workshops, vol. 2 8 (2013)
45.
go back to reference Wang, B., Wang, L., Shuai, B., Zuo, Z., Liu, T., Chan, K.L., Wang, G.: Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association. In: Computer Vision and Pattern Recognition Workshops, pp. 386–393 (2016) Wang, B., Wang, L., Shuai, B., Zuo, Z., Liu, T., Chan, K.L., Wang, G.: Joint learning of convolutional neural networks and temporally constrained metrics for tracklet association. In: Computer Vision and Pattern Recognition Workshops, pp. 386–393 (2016)
46.
go back to reference Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6402–6411 (2017) Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6402–6411 (2017)
47.
go back to reference Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Val Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision (2016) Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Val Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision (2016)
48.
go back to reference Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets (2015). arXiv preprint arXiv:1507.02159 Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets (2015). arXiv preprint arXiv:​1507.​02159
49.
go back to reference Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27 (2012) Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27 (2012)
50.
go back to reference Xin, M., Zhang, H., Sun, M., Yuan, D.: Recurrent temporal sparse autoencoder for attention-based action recognition. In: The International Joint Conference on Neural Networks, IEEE, pp. 456–463 (2016) Xin, M., Zhang, H., Sun, M., Yuan, D.: Recurrent temporal sparse autoencoder for attention-based action recognition. In: The International Joint Conference on Neural Networks, IEEE, pp. 456–463 (2016)
51.
go back to reference Xin, M., Zhang, H., Yuan, D., Sun, M.: Learning discriminative action and context representations for action recognition in still images. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp 1–6 (2017) Xin, M., Zhang, H., Yuan, D., Sun, M.: Learning discriminative action and context representations for action recognition in still images. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp 1–6 (2017)
52.
go back to reference Xin, M., Zhang, H., Wang, H., Sun, M., Yuan, D.: Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178, 87–102 (2016)CrossRef Xin, M., Zhang, H., Wang, H., Sun, M., Yuan, D.: Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178, 87–102 (2016)CrossRef
53.
go back to reference Yale, S., Louis-Philippe, M., Randall, D.: Action recognition by hierarchical sequence summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3562–3569 (2013) Yale, S., Louis-Philippe, M., Randall, D.: Action recognition by hierarchical sequence summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3562–3569 (2013)
54.
go back to reference Yeung, S., Russakovsky, O., Mori, G., Li, F.F.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016) Yeung, S., Russakovsky, O., Mori, G., Li, F.F.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2678–2687 (2016)
55.
go back to reference Yue-Hei Ng, J., Matthew, H., Sudheendra, V., Oriol, V., Rajat, M., George T.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015) Yue-Hei Ng, J., Matthew, H., Sudheendra, V., Oriol, V., Rajat, M., George T.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
56.
go back to reference Zhang, K., Zhang, L., Yang, M.H.: Real-time compressive tracking. In: Proceedings: European Conference on Computer Vision, pp. 864–877 (2012)CrossRef Zhang, K., Zhang, L., Yang, M.H.: Real-time compressive tracking. In: Proceedings: European Conference on Computer Vision, pp. 864–877 (2012)CrossRef
57.
go back to reference Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y., Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y.: Video question answering via hierarchical spatio-temporal attention networks. In: International Joint Conference on Artificial Intelligence, pp. 3518–3524 (2017) Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y., Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y.: Video question answering via hierarchical spatio-temporal attention networks. In: International Joint Conference on Artificial Intelligence, pp. 3518–3524 (2017)
58.
go back to reference Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
59.
go back to reference Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999 (2016) Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–1999 (2016)
Metadata
Title
End-to-end temporal attention extraction and human action recognition
Authors
Hong Zhang
Miao Xin
Shuhang Wang
Yifan Yang
Lei Zhang
Helong Wang
Publication date
18-07-2018
Publisher
Springer Berlin Heidelberg
Published in
Machine Vision and Applications / Issue 7/2018
Print ISSN: 0932-8092
Electronic ISSN: 1432-1769
DOI
https://doi.org/10.1007/s00138-018-0956-5

Other articles of this Issue 7/2018

Machine Vision and Applications 7/2018 Go to the issue

Premium Partner