Skip to main content
Top

2024 | OriginalPaper | Chapter

Action Segmentation Based on Encoder-Decoder and Global Timing Information

Authors : Yichao Liu, Yiyang Sun, Zhide Chen, Chen Feng, Kexin Zhu

Published in: Parallel and Distributed Computing, Applications and Technologies

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Action segment has made significant progress, but segmenting and recognizing actions from untrimmed long videos remains a challenging problem. Most state-of-the-art (SOTA) methods focus on designing models based on temporal convolution. However, the limitations of modeling long-term temporal dependencies and the inflexibility of temporal convolutions restrict the potential of these models. To address the issue of over-segmentation in existing action segmentation algorithms, which leads to prediction errors and reduced segmentation quality, this paper proposes an action segmentation algorithm based on Encoder-Decoder and global temporal information. The action segmentation algorithm based on Encoder-Decoder and global timing information proposed in this paper uses the global timing information captured by LSTM to assist the Encoder-Decoder structure in judging the action segmentation point more accurately and, at the same time, suppress the excessive segmentation phenomenon caused by the Encoder-Decoder structure. The algorithm proposed in this paper achieves 93% frame accuracy on the constructed real Taiji action data set. The experimental results prove that this model can accurately and efficiently complete the long video action segmentation task.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Zhang, Q., Lu, H., Sak, H., et al.: Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833. IEEE (2020) Zhang, Q., Lu, H., Sak, H., et al.: Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833. IEEE (2020)
2.
go back to reference Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016) Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
3.
go back to reference Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Wook Baik, S.: Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6, 1155–1166 (2017). 2 Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Wook Baik, S.: Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE Access 6, 1155–1166 (2017). 2
4.
go back to reference Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015). 2 Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015). 2
5.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017). 2, 6, 8 Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017). 2, 6, 8
6.
go back to reference Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013)
7.
go back to reference Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014)
8.
go back to reference Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one: training a diverse ensemble of deep networks.arXiv preprint arXiv:1511.06314 (2015). 2, 4 Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one: training a diverse ensemble of deep networks.arXiv preprint arXiv:​1511.​06314 (2015). 2, 4
9.
go back to reference Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018). 1, 2 Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6508–6516 (2018). 1, 2
11.
go back to reference Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
12.
go back to reference Sports Department of Peking University. Introduction to Type 24 Tai Chi. [2014–12–23] Sports Department of Peking University. Introduction to Type 24 Tai Chi. [2014–12–23]
14.
go back to reference Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
15.
go back to reference Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019). 1, 2, 4, 5,7, 9 Farha, Y.A., Gall, J.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019). 1, 2, 4, 5,7, 9
17.
go back to reference Singhania, D., Rahaman, R., Yao, A.: Coarse to fine multi-resolution temporal convolutional network. arXiv preprint arXiv:2105.10859 (2021) Singhania, D., Rahaman, R., Yao, A.: Coarse to fine multi-resolution temporal convolutional network. arXiv preprint arXiv:​2105.​10859 (2021)
18.
go back to reference Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings ofthe IEEE Conference on Computer vision and Pattern Recognition, pp. 6742–6751 (2018). 1, 2, 6, 7 Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings ofthe IEEE Conference on Computer vision and Pattern Recognition, pp. 6742–6751 (2018). 1, 2, 6, 7
19.
go back to reference Zhang, Z., Zhou, L., Ao, J., et al.: Speechut: bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. arXiv preprint arXiv:2210.03730, 2022 Zhang, Z., Zhou, L., Ao, J., et al.: Speechut: bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. arXiv preprint arXiv:​2210.​03730, 2022
20.
go back to reference Cao, S., Li, J., Nelson, K.P., et al.: Coupled VAE: improved accuracy and robustness of a variational autoencoder. Entropy 24(3), 423 (2022)MathSciNetCrossRef Cao, S., Li, J., Nelson, K.P., et al.: Coupled VAE: improved accuracy and robustness of a variational autoencoder. Entropy 24(3), 423 (2022)MathSciNetCrossRef
Metadata
Title
Action Segmentation Based on Encoder-Decoder and Global Timing Information
Authors
Yichao Liu
Yiyang Sun
Zhide Chen
Chen Feng
Kexin Zhu
Copyright Year
2024
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-8211-0_26

Premium Partner