nach oben

Multimedia Systems

Erschienen in:

14.07.2021 | Special Issue Paper

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

verfasst von: Shuren Zhou, Jia Qiu, Arun Solanki

Erschienen in: Multimedia Systems | Ausgabe 6/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The biggest difference between video-based action recognition and image-based action recognition is that the former has an extra feature of time dimension. Most methods of action recognition based on deep learning adopt: (1) using 3D convolution to modeling the temporal features; (2) introducing an auxiliary temporal feature, such as optical flow. However, the 3D convolution network usually consumes huge computational resources. The extraction of optical flow requires an extra tedious process with an extra space for storage, and is usually modeled for short-range temporal features. To construct the temporal features better, in this paper we propose a multi-scale attention spatial–temporal features network based on SSD, by means of piecewise on long range of the whole video sequence to sparse sampling of video, using the self-attention mechanism to capture the relation between one frame and the sequence of frames sampled on the entire range of video, making the network notice the representative frames on the sequence. Moreover, the attention mechanism is used to assign different weights to the inter-frame relations representing different time scales, so as to reasoning the contextual relations of actions in the time dimension. Our proposed method achieves competitive performance on two commonly used datasets: UCF101 and HMDB51.

Vorheriger Artikel Sanskrit to universal networking language EnConverter system based on deep learning and context-free grammar

Nächster Artikel Deep learning methods for anomalies detection in social networks using multidimensional networks and multimodal data: a survey

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Fusier, F., Valentin, V., Bremond, F.: Video understanding for complex activity recognition[J]. Mach. Vis. Appl. 18(3–4), 167–188 (2007)CrossRefMATH

Qin, J., Li, H., Xiang, X., Tan, Y., Pan, W., Ma, W., Xiong, N.N.: An encrypted image retrieval method based on Harris corner optimization and LSH in cloud computing. IEEE Access 7(1), 24626–24633 (2019)CrossRef

Gu, K., Jia, W., Wang, G., et al.: Efficient and secure attribute-based signature for monotone predicates. Acta Inform. 54, 521–541 (2017)MathSciNetCrossRefMATH

Wang J, Gao Y, Yin X, Li F, Kim H (2018) An enhanced PEGASIS algorithm with mobile sink support for wireless sensor networks. Wirel. Commun. Mob. Comput. (2018). https://doi.org/10.1155/2018/9472075CrossRef

Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. TPAMI 29(12), 2247–2253 (2007)CrossRef

Jia, K., Yeung, D.-Y.: Human action recognition using local spatio-temporal discriminant embedding. In CVPR, p. 1 (2008)

Klaeser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, p. 1 (2008)

Wang, H., Schmid, C.: Action recognition with improved trajectories. ICCV 1(5), 8 (2013)

Laptev, I.: On space–-time interest points. IJCV 64(2–3), 5 (2005)

10.

Xia, Z., Hu, Z., Luo, J.: UPTP vehicle trajectory prediction based on user preference under complexity environment. Wirel. Pers. Commun. 97, 4651–4665 (2017). https://doi.org/10.1007/s11277-017-4743-9CrossRef

11.

He, S., Li, Z., Tang, Y., Liao, Z., Li, F., Lim, S-J.: Parameters compressing in deep learning. CMC 62(1), 321–336 (2020)CrossRef

12.

Tang, Q., Xie, M.Z., Yang, K., Yuansheng, L. Dongdai, Z. Yun, S.: A decision function based smart charging and discharging strategy for electric vehicle in smart grid. Mob. Netw. Appl. 24, 1722–1731 (2019)CrossRef

13.

Ji, X., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRef

14.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NPIS, pp. 1097–1105 (2012)

15.

Karpathy, A., Toderici, G., Shetty, S., Leung, T.; Sukthankar, R.: Largescale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

16.

Long, M., Peng, F., Li, H.: Separable reversible data hiding and encryption for HEVC video. J. Real Time Image Proc. 14, 171–182 (2018)CrossRef

17.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., FeiFei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

18.

Zhang, J., Jin, X., Sun, J., Wang, J., Arun, K.S.: Spatial and semantic convolutional features for robust visual object tracking. In: Multimedia Tools and Applications, pp. 15095–15115 (2020)

19.

Gui, Y., Zeng, G.: Joint learning of visual and spatial features for edit propagation from a single image. In: The Visual Computer, pp. 36:469–482 (2019)

20.

Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

21.

Wang, L.M., Xiong, Y.J., Wang, Z., Qiao, Y., Lin, D.H.O., et al.: Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36 (2016)

22.

Liu, W., Anguelov, D., Erhan, D., Christian, S., Scott R., Cheng-Yang F., Alexander C.: SSD: Single Shot MultiBox Detector. In: European Conference on Computer Vision, pp. 21–37 (2016)

23.

Dalal, N.F., Triggs, B.S.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2005. USA: IEEE, pp. 886–893 (2005)

24.

Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, pp. 1–8 (2008)

25.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Palur, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, pp. 4489–4497 (2015)

26.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

27.

Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks (2017). arXiv:1711.10305

28.

Li, C., Zhong, Q., Xie, D, et al.: Collaborative spatio-temporal feature learning for video action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)

29.

Peng, Y.X., Zhao, Y.Z., Zhang, J.C.: Two-stream collaborative learning with spatial-temporal attention for video classification. In: IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773–786 (2018)

30.

Carreira, J., Zisserman, A.: Quo vadis, action recognition?a new model and the kinetics dataset. CVPR 2(4), 5 (2017)

31.

Sun, S., Kuang, Z, Ouyang, W., Sheng, L., Zhang, W: Optical flow guided feature: a fast and robust motion representation for video action recognition (2017). arXiv:1711.11152

32.

Fischer, P., Dosovitskiy, A., Ilg, E., Husser, P., Hazrba, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: FlowNet: learning optical flow with convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)

33.

Zhu, Y., Lan, Z.Z., Newsam, S., XHauptmann, S: Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, pp. 363–378 (2018)

34.

Piergiovanni, A., Ryoo, M.S: Representation flow for action recognition (2018). arXiv:1810.01455

35.

Mnih, V.F., Heess, N.S.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, NIPS (2014)

36.

Qu, Z.W., Cao, B.Y., Wang, X.R., Li, F., Xu, P.R., et al.: Feedback lstm network based on attention for image description generator. Comput. Mater. Contin. 59(2), 575–589 (2019)

37.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Neural Information Processing Systems (NIPS) (2017)

38.

Zhao, Z., Elgammal, A.M.: Information theoretic key frame selection for action recognition. In: British Machine Vision Conference (BMVC), pp. 1–10 (2008)

39.

Liu, L., Shao, L., Rockett, P.: Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recogn. 46(7), 1810–1818 (2013)CrossRef

40.

Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia,P., Lillicrap, T.: A simple neural network module for relational reasoning (2017). arXiv:1706.01427

41.

Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos (2017). arXiv:1711.08496

42.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition[J] (2014). arXiv:1409.1556

43.

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition[J] (2015). arXiv:1512.03385

44.

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)

45.

Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild[J] (2012). arXiv:1212.0402

46.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. International Conference on Computer Vision, pp. 2556–2563 (2011)

47.

Cai, Z.W., Wang, L.M., Peng, X.J.: Qiao, Y.: Multi-view super vector for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 596–603 (2014)

48.

Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2593–2600 (2014)

49.

Zhang, B.W., Wang, L.M, Wang, Z., Qiao, Y., Wang, H.L.: Real-time action recognition with enhanced motion vector CNNs. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2718–2726 (2016)

50.

Diba, A. et al.: Spatio-temporal channel correlation networks for action classification. In: Computer Vision—ECCV 2018, vol. 11208, pp. 299–315 (2018)

51.

Jiang, B., Wang, M., Gan, W., Wu, W.: STM: spatio-temporal and motion encoding for action recognition. In: ICCV (2019)

Titel: Improved SSD using deep multi-scale attention spatial–temporal features for action recognition
verfasst von: Shuren Zhou
Jia Qiu
Arun Solanki
Publikationsdatum: 14.07.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: Multimedia Systems / Ausgabe 6/2022
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-021-00831-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2022

Data-driven personalisation of television content: a survey

Fast stereo visual odometry based on LK optical flow and ORB-SLAM2

Understanding videos with face recognition: a complete pipeline and applications

ParaCap: paraphrase detection model using capsule network

Breast density measurement methods on mammograms: a review

Towards automatic placement of media objects in a personalised TV experience