nach oben

Pattern Analysis and Applications

Erschienen in:

15.05.2023 | Theoretical Advances

TSRN: two-stage refinement network for temporal action segmentation

verfasst von: Xiaoyan Tian, Ye Jin, Xianglong Tang

Erschienen in: Pattern Analysis and Applications | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In high-level video semantic understanding, continuous action segmentation is a challenging task aimed at segmenting an untrimmed video and labeling each segment with predefined labels over time. However, the accuracy of segment predictions is limited by confusing information in video sequences, such as ambiguous frames during action boundaries or over-segmentation errors due to the lack of semantic relations. In this work, we present a two-stage refinement network (TSRN) to improve temporal action segmentation. We first capture global relations over an entire video sequence using a multi-head self-attention mechanism in the novel transformer temporal convolutional network and model temporal relations in each action segment. Then, we introduce a dual-attention spatial pyramid pooling network to fuse features from macroscale and microscale perspectives, providing more accurate classification results from the initial prediction. In addition, a joint loss function mitigates over-segmentation. Compared with state-of-the-art methods, the proposed TSRN substantially improves temporal action segmentation on three challenging datasets (i.e., 50Salads, Georgia Tech Egocentric Activities, and Breakfast).

Vorheriger Artikel MPF6D: masked pyramid fusion 6D pose estimation

Nächster Artikel Multi-view clustering indicator learning with scaled similarity

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Febin IP, Jayasree K, Joy PT (2020) Violence detection in videos for an intelligent surveillance system using MoBSIFT and movement filtering algorithm. Pattern Anal Appl 23(2):611–623CrossRef

Pan Z, Liu S, Sangaiah AK, Muhammad K (2018) Visual attention feature (VAF): a novel strategy for visual tracking based on cloud platform in intelligent surveillance systems. J Parallel Distr Com 120:182–194CrossRef

Stenum J, Rossi C, Roemmich RT (2021) Two-dimensional video-based analysis of human gait using pose estimation. Plos Comput Biol 17(4):e1008935CrossRef

Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 4768–4777

Ding L, Xu C (2017) Tricornet: A hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint arXiv:1705.07818

Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), IEEE, pp 7083–7093

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), IEEE, pp 4489–4497

Singh B, Marks TK, Jones M, Tuzel O, Shao M (2016) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 1961–1970

10.

Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 156–165

11.

Farha YA, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 3575–3584

12.

Wang Z, Gao Z, Wang L, Li Z, Wu G (2020) Boundary-aware cascade networks for temporal action segmentation. In: Proceedings of the European conference on computer vision (ECCV), Springer, pp 34–51

13.

Ishikawa Y, Kasai S, Aoki Y, Kataoka H (2021) Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), IEEE, pp 2322–2331

14.

Li SJ, Abufarha Y, Liu Y, Cheng MM, Gall J (2020) Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal. https://doi.org/10.1109/TPAMI.2020.3021756CrossRef

15.

Chen MH, Li B, Bao Y, Alregib G, Kira Z (2020) Action segmentation with joint self-supervised temporal domain adaptation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 9454–9463

16.

Wang D, Hu D, Li X, Dou D (2021) Temporal Relational Modeling with Self-Supervision for Action Segmentation. In: Proceedings of the aaai conference on artificial intelligence (AAAI). 35(4), pp 2729–2737

17.

Stein S, Mckenna SJ (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp 729–738

18.

Fathi A, Ren X, Rehg JM (2011) Learning to recognize objects in egocentric activities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 3281–3288

19.

Kuehne H, Arslan A, Serre T (2014) The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 780–787

20.

Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

21.

Lei P, Todorovic S (2018) Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE, pp 6742–6751

22.

Zhang Y, Tang S, Muandet K, Jarvers C, Neumann H (2019) Local temporal bilinear pooling for fine-grained action parsing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 12005–12015

23.

Wang D, Yuan Y, Wang Q (2020) Gated forward refinement network for action segmentation. Neurocomputing 407:63–71CrossRef

24.

Huang Y, Sugano Y, Sato Y (2020) Improving action segmentation via graph-based temporal reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 14024–14034

25.

Chen MH, Li B, Bao Y, Alregib G (2020) Action segmentation with mixed temporal domain adaptation. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision (WACV), IEEE, pp 605–614

26.

Gao SH, Han Q, Li ZY, Peng P, Wang L, Cheng MM (2021) Global2local: Efficient structure search for video action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), IEEE, pp 16805–16814

27.

Kitaev N, Cao S, Klein D (2018) Multilingual constituency parsing with self-attention and pre-training. arXiv preprint arXiv:1812.11760.

28.

Cheng X, Qiu G, Jiang Y, Zhu Z (2021) An improved small object detection method based on Yolo V3. Pattern Anal Appl 24(3):1347–1355CrossRef

29.

Kuehne H, Gall J, Serre T (2016) An end-to-end generative framework for video segmentation and recognition. In: Processing of the IEEE/CVF Winter conference on applications of computer vision (WACV), IEEE, pp 1–8

30.

Arnab A, Dehghani M, Heigold G, Sun C, Lucic M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International conference on computer Vision (ICCV), IEEE, pp 6836–6846

31.

Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PHS (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE, pp 6881–6890

32.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

33.

He L, Wen S, Wang L, Li F (2021) Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell 51(4):2128–2143CrossRef

34.

Wang J, Xiong H, Wang H, Nian X (2020) ADSCNet: asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 50(4):1045–1056CrossRef

35.

Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 7132–7141

36.

Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), Springer, pp 3–19

37.

Lin TY, Dollar P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 2117–2125

38.

Lin TY, Goyal P, Girshick R, He K, Dollar P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV), IEEE, pp 2980–2988

39.

He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916CrossRef

40.

Tang K, Li FF, Koller D (2012) Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 1250–1257

41.

Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady 10(8):707–710MathSciNet

42.

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR), IEEE, pp 6299–6308

43.

Donahue J, Anne Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 2625–2634

44.

Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 3156–3164

45.

Tao L, Zappella L, Hager GD et al (2013) Surgical gesture segmentation and recognition. In: 2013 International conference on medical image computing and computer-assisted intervention (MICCAI), Springer, pp 339–346

46.

Rohrbach M, Amin S, Andriluka M et al (2012) A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 1194–1201

47.

Cheng Y, Fan Q, Pankanti S et al (2014) Temporal sequence modeling for video event detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 2227–2234

48.

Lea C, Reiter A, Vidal R, et al (2016) Segmental spatiotemporal cnns for fine-grained action segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), Springer, pp 36–52

49.

Zhang Y, Muandet K, Ma Q (2019) Frontal low-rank random tensors for fine-grained action segmentation. arXiv preprint arXiv:1906.01004.

50.

Mac KNC, Joshi D, Yeh RA, Xiong J, Feris RS, Do MN (2019) Learning motion in feature space: locally-consistent deformable convolution networks for fine-grained action detection. In: Proceedings of the IEEE/CVF International conference on computer vision (ICCV), IEEE, pp 6282–6291

51.

Richard A, Kuehne H, Gall J (2017) Weakly supervised action learning with rnn based fine-to-coarse modeling. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 754–763

52.

Li Z, Sun Y, Zhang L et al (2021) CTNet: context-based tandem network for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 44(12):9904–9917CrossRef

53.

Zhou H, Li Z, Ning C, et al (2017) Cad: Scale invariant framework for real-time object detection. In: Proceedings of the IEEE international conference on computer vision workshops, pp 760–768

Titel: TSRN: two-stage refinement network for temporal action segmentation
verfasst von: Xiaoyan Tian
Ye Jin
Xianglong Tang
Publikationsdatum: 15.05.2023
Verlag: Springer London
Erschienen in: Pattern Analysis and Applications / Ausgabe 3/2023
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI: https://doi.org/10.1007/s10044-023-01166-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2023

Weighted edit distance optimized using genetic algorithm for SMILES-based compound similarity

Global–local transformer for single-image rain removal

Prioritized air light and transmittance extraction (PATE) using dual weighted deep channel and spatial attention based model for image dehazing

SE-MD: a single-encoder multiple-decoder deep network for point cloud reconstruction from 2D images

Deep neural network watermarking based on a reversible image hiding network

DStab: estimating clustering quality by distance stability

Premium Partner