Skip to main content
Top
Published in: Multimedia Systems 6/2022

14-07-2021 | Special Issue Paper

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Authors: Shuren Zhou, Jia Qiu, Arun Solanki

Published in: Multimedia Systems | Issue 6/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial–temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Fusier, F., Valentin, V., Bremond, F.: Video understanding for complex activity recognition[J]. Mach. Vis. Appl. 18(3–4), 167–188 (2007)CrossRefMATH Fusier, F., Valentin, V., Bremond, F.: Video understanding for complex activity recognition[J]. Mach. Vis. Appl. 18(3–4), 167–188 (2007)CrossRefMATH
2.
go back to reference Qin, J., Li, H., Xiang, X., Tan, Y., Pan, W., Ma, W., Xiong, N.N.: An encrypted image retrieval method based on Harris corner optimization and LSH in cloud computing. IEEE Access 7(1), 24626–24633 (2019)CrossRef Qin, J., Li, H., Xiang, X., Tan, Y., Pan, W., Ma, W., Xiong, N.N.: An encrypted image retrieval method based on Harris corner optimization and LSH in cloud computing. IEEE Access 7(1), 24626–24633 (2019)CrossRef
3.
go back to reference Gu, K., Jia, W., Wang, G., et al.: Efficient and secure attribute-based signature for monotone predicates. Acta Inform. 54, 521–541 (2017)MathSciNetCrossRefMATH Gu, K., Jia, W., Wang, G., et al.: Efficient and secure attribute-based signature for monotone predicates. Acta Inform. 54, 521–541 (2017)MathSciNetCrossRefMATH
5.
go back to reference Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. TPAMI 29(12), 2247–2253 (2007)CrossRef Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. TPAMI 29(12), 2247–2253 (2007)CrossRef
6.
go back to reference Jia, K., Yeung, D.-Y.: Human action recognition using local spatio-temporal discriminant embedding. In CVPR, p. 1 (2008) Jia, K., Yeung, D.-Y.: Human action recognition using local spatio-temporal discriminant embedding. In CVPR, p. 1 (2008)
7.
go back to reference Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, p. 1 (2008) Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, p. 1 (2008)
8.
go back to reference Wang, H., Schmid, C.: Action recognition with improved trajectories. ICCV 1(5), 8 (2013) Wang, H., Schmid, C.: Action recognition with improved trajectories. ICCV 1(5), 8 (2013)
9.
go back to reference Laptev, I.: On space–-time interest points. IJCV 64(2–3), 5 (2005) Laptev, I.: On space–-time interest points. IJCV 64(2–3), 5 (2005)
11.
go back to reference He, S., Li, Z., Tang, Y., Liao, Z., Li, F., Lim, S-J.: Parameters compressing in deep learning. CMC 62(1), 321–336 (2020)CrossRef He, S., Li, Z., Tang, Y., Liao, Z., Li, F., Lim, S-J.: Parameters compressing in deep learning. CMC 62(1), 321–336 (2020)CrossRef
12.
go back to reference Tang, Q., Xie, M.Z., Yang, K., Yuansheng, L. Dongdai, Z. Yun, S.: A decision function based smart charging and discharging strategy for electric vehicle in smart grid. Mob. Netw. Appl. 24, 1722–1731 (2019)CrossRef Tang, Q., Xie, M.Z., Yang, K., Yuansheng, L. Dongdai, Z. Yun, S.: A decision function based smart charging and discharging strategy for electric vehicle in smart grid. Mob. Netw. Appl. 24, 1722–1731 (2019)CrossRef
13.
go back to reference Ji, X., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRef Ji, X., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRef
14.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NPIS, pp. 1097–1105 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NPIS, pp. 1097–1105 (2012)
15.
go back to reference Karpathy, A., Toderici, G., Shetty, S., Leung, T.; Sukthankar, R.: Largescale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) Karpathy, A., Toderici, G., Shetty, S., Leung, T.; Sukthankar, R.: Largescale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
16.
go back to reference Long, M., Peng, F., Li, H.: Separable reversible data hiding and encryption for HEVC video. J. Real Time Image Proc. 14, 171–182 (2018)CrossRef Long, M., Peng, F., Li, H.: Separable reversible data hiding and encryption for HEVC video. J. Real Time Image Proc. 14, 171–182 (2018)CrossRef
17.
go back to reference Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., FeiFei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., FeiFei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
18.
go back to reference Zhang, J., Jin, X., Sun, J., Wang, J., Arun, K.S.: Spatial and semantic convolutional features for robust visual object tracking. In: Multimedia Tools and Applications, pp. 15095–15115 (2020) Zhang, J., Jin, X., Sun, J., Wang, J., Arun, K.S.: Spatial and semantic convolutional features for robust visual object tracking. In: Multimedia Tools and Applications, pp. 15095–15115 (2020)
19.
go back to reference Gui, Y., Zeng, G.: Joint learning of visual and spatial features for edit propagation from a single image. In: The Visual Computer, pp. 36:469–482 (2019) Gui, Y., Zeng, G.: Joint learning of visual and spatial features for edit propagation from a single image. In: The Visual Computer, pp. 36:469–482 (2019)
20.
go back to reference Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014) Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
21.
go back to reference Wang, L.M., Xiong, Y.J., Wang, Z., Qiao, Y., Lin, D.H.O., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016) Wang, L.M., Xiong, Y.J., Wang, Z., Qiao, Y., Lin, D.H.O., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016)
22.
go back to reference Liu, W., Anguelov, D., Erhan, D., Christian, S., Scott R., Cheng-Yang F., Alexander C.: SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision, pp. 21–37 (2016) Liu, W., Anguelov, D., Erhan, D., Christian, S., Scott R., Cheng-Yang F., Alexander C.: SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision, pp. 21–37 (2016)
23.
go back to reference Dalal, N.F., Triggs, B.S.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2005. USA: IEEE, pp. 886–893 (2005) Dalal, N.F., Triggs, B.S.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2005. USA: IEEE, pp. 886–893 (2005)
24.
go back to reference Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, pp. 1–8 (2008) Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, pp. 1–8 (2008)
25.
go back to reference Tran, D., Bourdev, L., Fergus, R., Torresani, L., Palur, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497 (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., Palur, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497 (2015)
26.
go back to reference Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018) Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
27.
28.
go back to reference Li, C., Zhong, Q., Xie, D, et al.: Collaborative spatio-temporal feature learning for video action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019) Li, C., Zhong, Q., Xie, D, et al.: Collaborative spatio-temporal feature learning for video action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
29.
go back to reference Peng, Y.X., Zhao, Y.Z., Zhang, J.C.: Two-stream collaborative learning with spatial-temporal attention for video classification. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773–786 (2018) Peng, Y.X., Zhao, Y.Z., Zhang, J.C.: Two-stream collaborative learning with spatial-temporal attention for video classification. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773–786 (2018)
30.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition?a new model and the kinetics dataset. CVPR 2(4), 5 (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition?a new model and the kinetics dataset. CVPR 2(4), 5 (2017)
31.
go back to reference Sun, S., Kuang, Z, Ouyang, W., Sheng, L., Zhang, W: Optical flow guided feature: a fast and robust motion representation for video action recognition (2017). arXiv:1711.11152 Sun, S., Kuang, Z, Ouyang, W., Sheng, L., Zhang, W: Optical flow guided feature: a fast and robust motion representation for video action recognition (2017). arXiv:​1711.​11152
32.
go back to reference Fischer, P., Dosovitskiy, A., Ilg, E., Husser, P., Hazrba, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: International Conference on Computer Vision (ICCV) (2015) Fischer, P., Dosovitskiy, A., Ilg, E., Husser, P., Hazrba, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)
33.
go back to reference Zhu, Y., Lan, Z.Z., Newsam, S., XHauptmann, S: Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, pp. 363–378 (2018) Zhu, Y., Lan, Z.Z., Newsam, S., XHauptmann, S: Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, pp. 363–378 (2018)
35.
go back to reference Mnih, V.F., Heess, N.S.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, NIPS (2014) Mnih, V.F., Heess, N.S.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, NIPS (2014)
36.
go back to reference Qu, Z.W., Cao, B.Y., Wang, X.R., Li, F., Xu, P.R., et al.: Feedback lstm network based on attention for image description generator. Comput. Mater. Contin. 59(2), 575–589 (2019) Qu, Z.W., Cao, B.Y., Wang, X.R., Li, F., Xu, P.R., et al.: Feedback lstm network based on attention for image description generator. Comput. Mater. Contin. 59(2), 575–589 (2019)
37.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (NIPS) (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (NIPS) (2017)
38.
go back to reference Zhao, Z., Elgammal, A.M.: Information theoretic key frame selection for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–10 (2008) Zhao, Z., Elgammal, A.M.: Information theoretic key frame selection for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–10 (2008)
39.
go back to reference Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recogn. 46(7), 1810–1818 (2013)CrossRef Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recogn. 46(7), 1810–1818 (2013)CrossRef
40.
go back to reference Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia,P., Lillicrap, T.: A simple neural network module for relational reasoning (2017). arXiv:1706.01427 Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia,P., Lillicrap, T.: A simple neural network module for relational reasoning (2017). arXiv:​1706.​01427
42.
44.
go back to reference Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018) Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
45.
go back to reference Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild[J] (2012). arXiv:1212.0402 Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild[J] (2012). arXiv:​1212.​0402
46.
go back to reference Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. International Conference on Computer Vision, pp. 2556–2563 (2011) Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. International Conference on Computer Vision, pp. 2556–2563 (2011)
47.
go back to reference Cai, Z.W., Wang, L.M., Peng, X.J.: Qiao, Y.: Multi-view super vector for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014) Cai, Z.W., Wang, L.M., Peng, X.J.: Qiao, Y.: Multi-view super vector for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)
48.
go back to reference Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600 (2014) Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600 (2014)
49.
go back to reference Zhang, B.W., Wang, L.M, Wang, Z., Qiao, Y., Wang, H.L.: Real-time action recognition with enhanced motion vector CNNs. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2718–2726 (2016) Zhang, B.W., Wang, L.M, Wang, Z., Qiao, Y., Wang, H.L.: Real-time action recognition with enhanced motion vector CNNs. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2718–2726 (2016)
50.
go back to reference Diba, A. et al.: Spatio-temporal channel correlation networks for action classification. In: Computer Vision—ECCV 2018, vol. 11208, pp. 299–315 (2018) Diba, A. et al.: Spatio-temporal channel correlation networks for action classification. In: Computer Vision—ECCV 2018, vol. 11208, pp. 299–315 (2018)
51.
go back to reference Jiang, B., Wang, M., Gan, W., Wu, W.: STM: spatio-temporal and motion encoding for action recognition. In: ICCV (2019) Jiang, B., Wang, M., Gan, W., Wu, W.: STM: spatio-temporal and motion encoding for action recognition. In: ICCV (2019)
Metadata
Title
Improved SSD using deep multi-scale attention spatial–temporal features for action recognition
Authors
Shuren Zhou
Jia Qiu
Arun Solanki
Publication date
14-07-2021
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 6/2022
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-021-00831-4

Other articles of this Issue 6/2022

Multimedia Systems 6/2022 Go to the issue