Skip to main content
Erschienen in: Multimedia Systems 3/2021

22.01.2021 | Regular Paper

Video-based driver action recognition via hybrid spatial–temporal deep learning framework

verfasst von: Yaocong Hu, Mingqi Lu, Chao Xie, Xiaobo Lu

Erschienen in: Multimedia Systems | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Driver action recognition aims to distinguish normal driver action and some abnormal driver actions such as leaving the wheel, talking on the phone, diving with smoking, etc. For the purpose of traffic safety, studies on the computer vision technologies for driver action recognition have become especially meaningful. However, this issue is far from being solved, mainly due to the subtle variations between different driver action classes. In this paper, we present a new video-based driver action recognition approach based on the hybrid spatial–temporal deep learning framework. Specifically, we first design an encoder–decoder spatial–temporal convolutional neural network (EDSTCNN) to capture short-term spatial–temporal representation of driver actions jointly with optical flow prediction. Second, we exploit the feature refinement network (FRN) to refine the short-term driver action feature. Then, convolutional long short-term memory network (ConvLSTM) is employed for long-term spatial–temporal fusion. Finally, the fully connected neural network (FCNN) is used for final driver action recognition. In our experiment, we validate the performance of the proposed framework on our self-created datasets, including a simulated driving dataset and a real driving dataset. Extensive experimental results illustrate that the proposed hybrid spatial–temporal deep learning framework obtains the highest accuracy in multiple driver action recognition datasets (98.9% on SEU-DAR-V1 dataset and 97.0% on SEU-DAR-V2 dataset).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
6.
Zurück zum Zitat Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019)
9.
Zurück zum Zitat Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4476–4484 (2017). https://doi.org/10.1109/CVPR.2017.476 Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4476–4484 (2017). https://​doi.​org/​10.​1109/​CVPR.​2017.​476
11.
Zurück zum Zitat Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The something something video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017). https://doi.org/10.1109/ICCV.2017.622 Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The something something video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017). https://​doi.​org/​10.​1109/​ICCV.​2017.​622
14.
Zurück zum Zitat Hu, Y., Chang, H., Nian, F., Wang, Y., Li, T.: Dense crowd counting from still images with convolutional neural networks. Journal of Visual Communication and Image Representation 38, 530 – 539 (2016). https://doi.org/10.1016/j.jvcir.2016.03.021. http://www.sciencedirect.com/science/article/pii/S1047 320316300256 Hu, Y., Chang, H., Nian, F., Wang, Y., Li, T.: Dense crowd counting from still images with convolutional neural networks. Journal of Visual Communication and Image Representation 38, 530 – 539 (2016). https://​doi.​org/​10.​1016/​j.​jvcir.​2016.​03.​021. http://​www.​sciencedirect.​com/​science/​article/​pii/​S1047 320316300256
17.
Zurück zum Zitat Hu, Y., Lu, M., Lu, X.: Feature refinement for image-based driver action recognition via multi-scale attention convolutional neural network. Signal Processing: Image Communication 81, 115,697 (2020). https://doi.org/10.1016/j.image.2019.115697. http://www.sciencedirect.com/science/article/pii/S0923 596519300980 Hu, Y., Lu, M., Lu, X.: Feature refinement for image-based driver action recognition via multi-scale attention convolutional neural network. Signal Processing: Image Communication 81, 115,697 (2020). https://​doi.​org/​10.​1016/​j.​image.​2019.​115697. http://​www.​sciencedirect.​com/​science/​article/​pii/​S0923 596519300980
18.
Zurück zum Zitat Joe Yue-Hei Ng, Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702 (2015) Joe Yue-Hei Ng, Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4694–4702 (2015)
20.
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://​doi.​org/​10.​1109/​CVPR.​2014.​223
22.
Zurück zum Zitat Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arxiv e-prints (2014) Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arxiv e-prints (2014)
24.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, p. 2012 Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, p. 2012
25.
Zurück zum Zitat Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: A large video database for human motion recognition. In: Nagel, W.E., Kröner, D.H., Resch, M.M. (eds.) High Performance Computing in Science and Engineering ‘12, pp. 571–582. Springer, Berlin Heidelberg, Berlin, Heidelberg (2013) Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: Hmdb51: A large video database for human motion recognition. In: Nagel, W.E., Kröner, D.H., Resch, M.M. (eds.) High Performance Computing in Science and Engineering ‘12, pp. 571–582. Springer, Berlin Heidelberg, Berlin, Heidelberg (2013)
26.
Zurück zum Zitat Le, T.H.N., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 46–53 (2016). https://doi.org/10.1109/CVPRW.2016.13 Le, T.H.N., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 46–53 (2016). https://​doi.​org/​10.​1109/​CVPRW.​2016.​13
28.
Zurück zum Zitat Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016)
31.
32.
Zurück zum Zitat Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150, 109 – 125 (2016). https://doi.org/10.1016/j.cviu.2016.03.013. http://www.sciencedirect.com/science/article/pii/S1077 314216300091 Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150, 109 – 125 (2016). https://​doi.​org/​10.​1016/​j.​cviu.​2016.​03.​013. http://​www.​sciencedirect.​com/​science/​article/​pii/​S1077 314216300091
36.
Zurück zum Zitat Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pp. 234–241. Springer International Publishing, Cham (2015)CrossRef Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015, pp. 234–241. Springer International Publishing, Cham (2015)CrossRef
37.
Zurück zum Zitat Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.k., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 802–810. MIT Press, Cambridge, MA, USA (2015). http://dl.acm.org/citation.cfm?id=2969239.2969329 Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.k., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 802–810. MIT Press, Cambridge, MA, USA (2015). http://​dl.​acm.​org/​citation.​cfm?​id=​2969239.​2969329
39.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, p. 568-576. MIT Press, Cambridge, MA, USA (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, p. 568-576. MIT Press, Cambridge, MA, USA (2014)
40.
42.
Zurück zum Zitat Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). arXiv:abs/1212.0402 Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). arXiv:​abs/​1212.​0402
43.
44.
Zurück zum Zitat Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015) Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)
45.
Zurück zum Zitat Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018) Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
46.
Zurück zum Zitat Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision, pp. 3551–3558 (2013) Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
47.
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision—ECCV 2016, pp. 20–36. Springer International Publishing, Cham (2016) Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision—ECCV 2016, pp. 20–36. Springer International Publishing, Cham (2016)
48.
Zurück zum Zitat Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018) Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
49.
53.
Zurück zum Zitat Yang Bai, Lihua Guo, Lianwen Jin, Qinghua Huang: A novel feature extraction method using pyramid histogram of orientation gradients for smile recognition. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 3305–3308 (2009). https://doi.org/10.1109/ICIP.2009.5413938 Yang Bai, Lihua Guo, Lianwen Jin, Qinghua Huang: A novel feature extraction method using pyramid histogram of orientation gradients for smile recognition. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 3305–3308 (2009). https://​doi.​org/​10.​1109/​ICIP.​2009.​5413938
55.
56.
Zurück zum Zitat Zhao, C., Gao, Y., He, J., Lian, J.: Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier. Engineering Applications of Artificial Intelligence 25(8), 1677–1686 (2012). https://doi.org/10.1016/j.engappai.2012.09.018. http://www.sciencedirect.com/science/article/pii/S0952 197612002564 Zhao, C., Gao, Y., He, J., Lian, J.: Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier. Engineering Applications of Artificial Intelligence 25(8), 1677–1686 (2012). https://​doi.​org/​10.​1016/​j.​engappai.​2012.​09.​018. http://​www.​sciencedirect.​com/​science/​article/​pii/​S0952 197612002564
Metadaten
Titel
Video-based driver action recognition via hybrid spatial–temporal deep learning framework
verfasst von
Yaocong Hu
Mingqi Lu
Chao Xie
Xiaobo Lu
Publikationsdatum
22.01.2021
Verlag
Springer Berlin Heidelberg
Erschienen in
Multimedia Systems / Ausgabe 3/2021
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-020-00724-y

Weitere Artikel der Ausgabe 3/2021

Multimedia Systems 3/2021 Zur Ausgabe