Skip to main content
Top
Published in: Cognitive Computation 3/2022

09-11-2021

Action Recognition with a Multi-View Temporal Attention Network

Authors: Dengdi Sun, Zhixiang Su, Zhuanlian Ding, Bin Luo

Published in: Cognitive Computation | Issue 3/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Action recognition is a fundamental and challenging task in computer vision. In recent years, optical flow, as the auxiliary information of frames in a video, has been widely applied to action recognition because of its advantage of utilizing the motion information of video data. However, existing methods only fuse the score of classification probabilities of the two streams; they do not consider the interaction between the image frames and the optical flows. In addition, the other important challenges lie in capturing significant motion information to be able to recognize the action. To overcome these problems, an action recognition model based on a multi-view temporal attention mechanism is proposed in this paper. Specifically, global temporal attention pooling is first designed to fuse multiple frame image features, where more attention is given to discriminative frames. Second, considering the complementarity of the image frame and optical flow, feature-level multi-view fusion methods are proposed. Experiments on three widely used benchmark datasets on action recognition show that our method outperforms other existing state-of-the-art methods. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors, such as the temporal attention pooling strategy, multi-view feature fusion and network architecture. The promising experimental results demonstrate that introducing the temporal attention layer and feature-level multi-view fusion methods is of great effectiveness and overcomes the shortcomings of classical two-stream networks to some extent. Specifically, the proposed method has the following advantages. First, the temporal attention layer can accurately capture key frames that are more conducive to recognizing actions. Second, two kinds of features from image frames and optical flows are combined to make full use of their complementarity. Finally, a variety of fusion methods are employed for feature-level fusion instead of straightforward score fusion.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. (pp. 2568-2577). Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. (pp. 2568-2577).
2.
go back to reference Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 2014. p. 568–576. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 2014. p. 568–576.
3.
go back to reference Wang H, Schmid C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 2013. pp. 3551-3558. Wang H, Schmid C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 2013. pp. 3551-3558.
4.
go back to reference Wang L, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2013. pp. 2674-2681. Wang L, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2013. pp. 2674-2681.
5.
go back to reference Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4305–4314. Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4305–4314.
6.
go back to reference Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4694–4702. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. pp. 4694–4702.
7.
go back to reference Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer. 2016. p. 20–36. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision.  Springer. 2016. p. 20–36.
8.
go back to reference Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4768–4777. Feichtenhofer C, Pinz A, Wildes RP. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 4768–4777.
9.
go back to reference Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 1933–1941. Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 1933–1941.
10.
go back to reference Wang Y, Long M, Wang J, Yu PS. Spatiotemporal pyramid network for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 1529–1538. Wang Y, Long M, Wang J, Yu PS. Spatiotemporal pyramid network for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 1529–1538.
11.
go back to reference Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 6299–6308. Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 6299–6308.
12.
go back to reference Diba A, Fayyaz M, Sharma V, Arzani MM, Yousefzadeh R, Gall J, Van Gool L. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 284–299. Diba A, Fayyaz M, Sharma V, Arzani MM, Yousefzadeh R, Gall J, Van Gool L. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 284–299.
13.
go back to reference Stroud J, Ross D, Sun C, Deng J, Sukthankar R. D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. p. 625–634. Stroud J, Ross D, Sun C, Deng J, Sukthankar R. D3d: Distilled 3d networks for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020. p. 625–634.
14.
go back to reference Wei P, Sun H, Zheng N. Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia. 2019;21(9):2195–208.CrossRef Wei P, Sun H, Zheng N. Learning composite latent structures for 3d human action representation and recognition. IEEE Trans Multimedia. 2019;21(9):2195–208.CrossRef
15.
go back to reference Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L. Temporal 3d convnets: New architecture and transfer learning for video classification. 2017. arXiv preprint arXiv:1711.08200. Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L. Temporal 3d convnets: New architecture and transfer learning for video classification. 2017. arXiv preprint arXiv:​1711.​08200.
16.
go back to reference He D, Zhou Z, Chuang Gan F, Li XL, Li Y, Wang L, Wen S. Stnet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:8401–8.CrossRef He D, Zhou Z, Chuang Gan F, Li XL, Li Y, Wang L, Wen S. Stnet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:8401–8.CrossRef
17.
go back to reference Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 7083–7093. Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 7083–7093.
18.
go back to reference Qiu Z, Yao T, Ngo CW, Tian X, Mei T. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 12056–12065. Qiu Z, Yao T, Ngo CW, Tian X, Mei T. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 12056–12065.
19.
go back to reference Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 305–321. Xie S, Sun C, Huang J, Tu Z, Murphy K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 2018. p. 305–321.
20.
go back to reference Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 6450–6459. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 6450–6459.
21.
go back to reference Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer vision. 2018. p. 695–712. Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer vision. 2018. p. 695–712.
22.
go back to reference Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 2625–2634. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 2625–2634.
23.
go back to reference Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. p. 1725–1732. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. p. 1725–1732.
24.
go back to reference Sun L, Jia K, Yeung DY, Shi BE. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4597–4605. Sun L, Jia K, Yeung DY, Shi BE. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4597–4605.
25.
go back to reference Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4507–4515. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4507–4515.
26.
go back to reference Zhao Y, Xiong Y, Lin D. Trajectory convolution for action recognition. In Advances in Neural Information Processing Systems. 2018. p. 2204–2215. Zhao Y, Xiong Y, Lin D. Trajectory convolution for action recognition. In Advances in Neural Information Processing Systems. 2018. p. 2204–2215.
27.
go back to reference Zhigang T, Li H, Zhang D, Dauwels J, Li B, Yuan J. Action-stage emphasized spatiotemporal vlad for video action recognition. IEEE Trans Image Process. 2019;28(6):2799–812.MathSciNetCrossRef Zhigang T, Li H, Zhang D, Dauwels J, Li B, Yuan J. Action-stage emphasized spatiotemporal vlad for video action recognition. IEEE Trans Image Process. 2019;28(6):2799–812.MathSciNetCrossRef
28.
go back to reference Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4489–4497. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 4489–4497.
29.
go back to reference Du Y, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 1110–1118. Du Y, Wang W, Wang L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 1110–1118.
30.
go back to reference Li W, Wen L, Chang MC, Nam Lim S, Lyu S. Adaptive RNN tree for large-scale human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 1444–1452. Li W, Wen L, Chang MC, Nam Lim S, Lyu S. Adaptive RNN tree for large-scale human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 1444–1452.
31.
go back to reference Lev G, Sadeh G, Klein B, Wolf L. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision. 2016. p. 833–850. Springer. Lev G, Sadeh G, Klein B, Wolf L. RNN fisher vectors for action recognition and image annotation. In Proceedings of the European Conference on Computer Vision. 2016. p. 833–850. Springer.
32.
go back to reference Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2222–2232. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst. 2016;28(10):2222–2232.
33.
go back to reference Huang Z, Xu W, Yu K. Bidirectional lstm-crf models for sequence tagging. 2015. Huang Z, Xu W, Yu K. Bidirectional lstm-crf models for sequence tagging. 2015.
34.
go back to reference Liu J, Wang G, Hu P, Duan LY, Kot AC. Global context-aware attention LSTM networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 1647–1656. Liu J, Wang G, Hu P, Duan LY, Kot AC. Global context-aware attention LSTM networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.  2017. p. 1647–1656.
35.
go back to reference Velickovic P, Cucurull G, Casanova A, Romero A. Pietro Lio, and Yoshua Bengio. Graph attention networks. 2018. Velickovic P, Cucurull G, Casanova A, Romero A. Pietro Lio, and Yoshua Bengio. Graph attention networks. 2018.
36.
go back to reference Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. 2016. Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. 2016.
37.
go back to reference Jian-Fang H, Zheng WS, Lai J, Zhang J. Jointly learning heterogeneous features for rgb-d activity recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):5344–52. Jian-Fang H, Zheng WS, Lai J, Zhang J. Jointly learning heterogeneous features for rgb-d activity recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):5344–52.
38.
go back to reference Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 6202–6211. Feichtenhofer C, Fan H, Malik J, He K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 6202–6211.
39.
go back to reference Yang C, Xu Y, Shi J, Dai B, Zhou B. Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020. p. 591–600. Yang C, Xu Y, Shi J, Dai B, Zhou B. Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020. p. 591–600.
40.
go back to reference Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 2818–2826. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 2818–2826.
41.
go back to reference Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell. 2017;40(6):1510–7.CrossRef Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell. 2017;40(6):1510–7.CrossRef
42.
go back to reference Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
43.
go back to reference Ma CY, Chen MH, Kira Z, AlRegib G. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Process Image Commun. 2019;71:76–87.CrossRef Ma CY, Chen MH, Kira Z, AlRegib G. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Process Image Commun. 2019;71:76–87.CrossRef
44.
go back to reference Kar A, Rai N, Sikka K, Sharma G. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 3376–3385. Kar A, Rai N, Sikka K, Sharma G. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 3376–3385.
45.
go back to reference Sun L, Jia K, Chen K, Yeung DY, Shi BE, Savarese S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2147–2156. Sun L, Jia K, Chen K, Yeung DY, Shi BE, Savarese S. Lattice long short-term memory for human action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2147–2156.
46.
go back to reference Zhou Y, Sun X, Zha ZJ, Zeng W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 449–458. Zhou Y, Sun X, Zha ZJ, Zeng W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 449–458.
47.
go back to reference Zhu Y, Lan Z, Newsam S, Hauptmann A. Hidden two-stream convolutional networks for action recognition. In Asian Conference on Computer Vision. 2018. p. 363–378. Springer. Zhu Y, Lan Z, Newsam S, Hauptmann A. Hidden two-stream convolutional networks for action recognition. In Asian Conference on Computer Vision. 2018. p. 363–378. Springer.
48.
go back to reference Liu Q, Che X, Bie M. R-stan: Residual spatial-temporal attention network for action recognition. IEEE Access. 2019;7:82246–55.CrossRef Liu Q, Che X, Bie M. R-stan: Residual spatial-temporal attention network for action recognition. IEEE Access. 2019;7:82246–55.CrossRef
49.
go back to reference Sudhakaran S, Escalera S, Lanz O. Hierarchical feature aggregation networks for video action recognition. 2019. arXiv preprint arXiv:1905.12462. Sudhakaran S, Escalera S, Lanz O. Hierarchical feature aggregation networks for video action recognition. 2019. arXiv preprint arXiv:​1905.​12462.
50.
go back to reference Zhao J, Snoek CG. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 9935–9944. Zhao J, Snoek CG. Dance with flow: Two-in-one stream action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 9935–9944.
51.
go back to reference Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2017;31. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2017;31.
52.
go back to reference Chen Y, Kalantidis Y, Li J, Yan S, Feng J. Multi-fiber networks for video recognition. In Proceedings of The European Conference on Computer Vision. 2018. p. 352–367. Chen Y, Kalantidis Y, Li J, Yan S, Feng J. Multi-fiber networks for video recognition. In Proceedings of The European Conference on Computer Vision. 2018. p. 352–367.
53.
go back to reference Fan Q, Chen CF, Kuehne H, Pistoia M, Cox D. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. In Advances in Neural Information Processing Systems. 2019;32. Fan Q, Chen CF, Kuehne H, Pistoia M, Cox D. More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. In Advances in Neural Information Processing Systems. 2019;32.
54.
go back to reference Chen Y, Fan H, Xu B, Yan Z, Kalantidis Y, Rohrbach M, Yan S, Feng J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 3435–3444. Chen Y, Fan H, Xu B, Yan Z, Kalantidis Y, Rohrbach M, Yan S, Feng J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 3435–3444.
55.
go back to reference Feichtenhofer C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 203–213. Feichtenhofer C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 203–213.
56.
go back to reference Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B. Movinets: Mobile video networks for efficient video recognition. 2021. arXiv preprint arXiv:2103.11511. Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B. Movinets: Mobile video networks for efficient video recognition. 2021. arXiv preprint arXiv:​2103.​11511.
Metadata
Title
Action Recognition with a Multi-View Temporal Attention Network
Authors
Dengdi Sun
Zhixiang Su
Zhuanlian Ding
Bin Luo
Publication date
09-11-2021
Publisher
Springer US
Published in
Cognitive Computation / Issue 3/2022
Print ISSN: 1866-9956
Electronic ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-021-09951-5

Other articles of this Issue 3/2022

Cognitive Computation 3/2022 Go to the issue

Premium Partner