nach oben

Multimedia Systems

Erschienen in:

22.01.2021 | Regular Paper

Video-based driver action recognition via hybrid spatial–temporal deep learning framework

verfasst von: Yaocong Hu, Mingqi Lu, Chao Xie, Xiaobo Lu

Erschienen in: Multimedia Systems | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Driver action recognition aims to distinguish normal driver action and some abnormal driver actions such as leaving the wheel, talking on the phone, diving with smoking, etc. For the purpose of traffic safety, studies on the computer vision technologies for driver action recognition have become especially meaningful. However, this issue is far from being solved, mainly due to the subtle variations between different driver action classes. In this paper, we present a new video-based driver action recognition approach based on the hybrid spatial–temporal deep learning framework. Specifically, we first design an encoder–decoder spatial–temporal convolutional neural network (EDSTCNN) to capture short-term spatial–temporal representation of driver actions jointly with optical flow prediction. Second, we exploit the feature refinement network (FRN) to refine the short-term driver action feature. Then, convolutional long short-term memory network (ConvLSTM) is employed for long-term spatial–temporal fusion. Finally, the fully connected neural network (FCNN) is used for final driver action recognition. In our experiment, we validate the performance of the proposed framework on our self-created datasets, including a simulated driving dataset and a real driving dataset. Extensive experimental results illustrate that the proposed hybrid spatial–temporal deep learning framework obtains the highest accuracy in multiple driver action recognition datasets (98.9% on SEU-DAR-V1 dataset and 97.0% on SEU-DAR-V2 dataset).

Vorheriger Artikel Study of transmission system for compressed and encrypted image

Nächster Artikel Deep reconstruction of 1D ISOMAP representations

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Bradski, G.R., Davis, J.W.: Motion segmentation and pose recognition with motion history gradients. Mach. Vision Appl. 13(3), 174–184 (2002). https://doi.org/10.1007/s001380100064CrossRef

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)

Cong Zhang, Hongsheng Li, Wang, X., Xiaokang Yang: Cross-scene crowd counting via deep convolutional neural networks. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 833–841 (2015). https://doi.org/10.1109/CVPR.2015.7298684

Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE. Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017). https://doi.org/10.1109/TPAMI.2016.2599174CrossRef

Dong, Y., Hu, Z., Uchimura, K., Murayama, N.: Driver inattention monitoring system for intelligent vehicles: a review. IEEE Trans. Intell. Trans. Syst. 12(2), 596–614 (2011). https://doi.org/10.1109/TITS.2010.2092770CrossRef

Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019)

Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, R. Garnett (eds.) Advances in Neural Information Processing Systems 29, pp. 3468–3476. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6433-spatiotemporal-residual-networks-for-video-action-recognition.pdf

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213

Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4476–4484 (2017). https://doi.org/10.1109/CVPR.2017.476

10.

Gao, R., Xiong, B., Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5937–5947 (2018). https://doi.org/10.1109/CVPR.2018.00622

11.

Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The something something video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017). https://doi.org/10.1109/ICCV.2017.622

12.

He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015). https://doi.org/10.1109/TPAMI.2015.2389824CrossRef

13.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

14.

Hu, Y., Chang, H., Nian, F., Wang, Y., Li, T.: Dense crowd counting from still images with convolutional neural networks. Journal of Visual Communication and Image Representation 38, 530 – 539 (2016). https://doi.org/10.1016/j.jvcir.2016.03.021. http://www.sciencedirect.com/science/article/pii/S1047 320316300256

15.

Hu, Y., Lu, M., Lu, X.: Driving behaviour recognition from still images by using multi-stream fusion cnn. Mach. Vis Appl. (2018). https://doi.org/10.1007/s00138-018-0994-zCrossRef

16.

Hu, Y., Lu, M., Lu, X.: Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), pp. 1271–1277 (2018). https://doi.org/10.1109/ICARCV.2018.8581201

17.

Hu, Y., Lu, M., Lu, X.: Feature refinement for image-based driver action recognition via multi-scale attention convolutional neural network. Signal Processing: Image Communication 81, 115,697 (2020). https://doi.org/10.1016/j.image.2019.115697. http://www.sciencedirect.com/science/article/pii/S0923 596519300980

18.

Joe Yue-Hei Ng, Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702 (2015)

19.

Kaplan, S., Guvensan, M.A., Yavuz, A.G., Karalurt, Y.: Driver behavior analysis for safe driving: a survey. IEEE Trans. Intell. Trans. Syst. 16(6), 3017–3032 (2015). https://doi.org/10.1109/TITS.2015.2462084CrossRef

20.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223

21.

Ketkar, N.: Introduction to PyTorch, pp. 195–208. Apress, Berkeley, CA (2017). https://doi.org/10.1007/978-1-4842-2766-4_12

22.

Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arxiv e-prints (2014)

23.

Koesdwiady, A., Soua, R., Karray, F., Kamel, M.S.: Recent trends in driver safety monitoring systems: state of the art and challenges. IEEE Trans. Vehicular Technol. 66(6), 4550–4563 (2017). https://doi.org/10.1109/TVT.2016.2631604CrossRef

24.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, p. 2012

25.

Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: A large video database for human motion recognition. In: Nagel, W.E., Kröner, D.H., Resch, M.M. (eds.) High Performance Computing in Science and Engineering ‘12, pp. 571–582. Springer, Berlin Heidelberg, Berlin, Heidelberg (2013)

26.

Le, T.H.N., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 46–53 (2016). https://doi.org/10.1109/CVPRW.2016.13

27.

Li, L., Wen, D., Zheng, N., Shen, L.: Cognitive cars: a new frontier for adas research. IEEE Trans Intel. Trans. Syst. 13(1), 395–407 (2012). https://doi.org/10.1109/TITS.2011.2159493CrossRef

28.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016)

29.

Ma, C., Huang, J., Yang, X., Yang, M.: Hierarchical convolutional features for visual tracking. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3074–3082 (2015). https://doi.org/10.1109/ICCV.2015.352

30.

Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302 (2016). https://doi.org/10.1109/CVPR.2016.465

31.

Ondruska, P., Posner, I.: Deep tracking: Seeing beyond seeing using recurrent neural networks. CoRR abs/1602.00991 (2016). arXiv:abs/1602.00991

32.

Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150, 109 – 125 (2016). https://doi.org/10.1016/j.cviu.2016.03.013. http://www.sciencedirect.com/science/article/pii/S1077 314216300091

33.

Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542 (2017). https://doi.org/10.1109/ICCV.2017.590

34.

Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91

35.

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031CrossRef

36.

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pp. 234–241. Springer International Publishing, Cham (2015)CrossRef

37.

Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.k., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 802–810. MIT Press, Cambridge, MA, USA (2015). http://dl.acm.org/citation.cfm?id=2969239.2969329

38.

Shi, X., Gao, Z., Lausen, L., Wang, H., Yeung, D.Y., Wong, W.k., WOO, W.c.: Deep learning for precipitation nowcasting: A benchmark and a new model. In: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (eds.) Advances in Neural Information Processing Systems 30, pp. 5617–5627. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7145-deep-learning-for-precipitation-nowcasting-a-benchmark-and-a-new-model.pdf

39.

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, p. 568-576. MIT Press, Cambridge, MA, USA (2014)

40.

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). arXiv:abs/1409.1556

41.

Singh, D., Mohan, C.K.: Deep spatio-temporal representation for detection of road accidents using stacked autoencoder. IEEE Trans. Intell. Trans. Syst. 20(3), 879–887 (2019). https://doi.org/10.1109/TITS.2018.2835308MathSciNetCrossRef

42.

Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). arXiv:abs/1212.0402

43.

Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

44.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)

45.

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

46.

Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

47.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision—ECCV 2016, pp. 20–36. Springer International Publishing, Cham (2016)

48.

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

49.

Weigao Qiao, Bo Shang: Current situation analysis and safety countermeasure study on road traffic accidents in china. In: 2011 International Conference on Remote Sensing, Environment and Transportation Engineering, pp. 3034–3037 (2011). https://doi.org/10.1109/RSETE.2011.5964954

50.

Xing, Y., Lv, C., Wang, H., Cao, D., Velenis, E., Wang, F.: Driver activity recognition for intelligent vehicles: a deep learning approach. IEEE Trans. Vehicular Technol. 68(6), 5379–5390 (2019). https://doi.org/10.1109/TVT.2019.2908425CrossRef

51.

Yan, C., Coenen, F., Zhang, B.L.: Driving posture recognition by joint application of motion history image and pyramid histogram of oriented gradients. In: Advances in Mechatronics, Automation and Applied Information Technologies, Advanced Materials Research, vol. 846, pp. 1102–1105. Trans Tech Publications (2014). https://doi.org/10.4028/www.scientific.net/AMR.846-847.1102

52.

Yan, C., Zhang, B., Coenen, F.: Driving posture recognition by convolutional neural networks. In: 2015 11th International Conference on Natural Computation (ICNC), pp. 680–685 (2015). https://doi.org/10.1109/ICNC.2015.7378072

53.

Yang Bai, Lihua Guo, Lianwen Jin, Qinghua Huang: A novel feature extraction method using pyramid histogram of orientation gradients for smile recognition. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 3305–3308 (2009). https://doi.org/10.1109/ICIP.2009.5413938

54.

Zhang, Q., Zhou, Y., Song, S., Liang, G., Ni, H.: Heart rate extraction based on near-infrared camera: towards driver state monitoring. IEEE Access 6, 33076–33087 (2018). https://doi.org/10.1109/ACCESS.2018.2845390CrossRef

55.

Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 589–597 (2016). https://doi.org/10.1109/CVPR.2016.70

56.

Zhao, C., Gao, Y., He, J., Lian, J.: Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier. Engineering Applications of Artificial Intelligence 25(8), 1677–1686 (2012). https://doi.org/10.1016/j.engappai.2012.09.018. http://www.sciencedirect.com/science/article/pii/S0952 197612002564

57.

Zhao, C., Zhang, B., Lian, J., He, J., Lin, T., Zhang, X.: Classification of driving postures by support vector machines. In: 2011 Sixth International Conference on Image and Graphics, pp. 926–930 (2011). https://doi.org/10.1109/ICIG.2011.184

58.

Zhao, C.H., Zhang, B.L., He, J., Lian, J.: Recognition of driving postures by contourlet transform and random forests. IET Intell. Trans. Syst. 6(2), 161–168 (2012). https://doi.org/10.1049/iet-its.2011.0116CrossRef

59.

Zhao, C.H., Zhang, B.L., Zhang, X.Z., Zhao, S.Q., Li, H.X.: Recognition of driving postures by combined features and random subspace ensemble of multilayer perceptron classifiers. Neural Comput. Appl. 22(1), 175–184 (2013). https://doi.org/10.1007/s00521-012-1057-4CrossRef

Titel: Video-based driver action recognition via hybrid spatial–temporal deep learning framework
verfasst von: Yaocong Hu
Mingqi Lu
Chao Xie
Xiaobo Lu
Publikationsdatum: 22.01.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: Multimedia Systems / Ausgabe 3/2021
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-020-00724-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2021

A novel pixel-wise authentication-based self-embedding fragile watermarking method

Low-complexity reversible data hiding in encrypted image via MSB hierarchical coding and LSB compression

Special issue on low complexity methods for multimedia security

Understanding the limits of 2D skeletons for action recognition

Robust and fast image hashing with two-dimensional PCA

3D color channel based adaptive contrast enhancement using compensated histogram system