Skip to main content
main-content

Tipp

Weitere Artikel dieser Ausgabe durch Wischen aufrufen

24.09.2022

Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization

verfasst von: Mingwen Bi, Jiaqi Li, Xinliang Liu, Qingchuan Zhang, Zhenghong Yang

Erschienen in: Neural Processing Letters

Einloggen, um Zugang zu erhalten
share
TEILEN

Abstract

Weakly-supervised temporal action localization aims to detect the temporal boundaries of action instances in untrimmed videos only by relying on video-level action labels. The main challenge of the research is to accurately segment the action from the background in the absence of frame-level labels. Previous methods consider the action-related context in the background as the main factor restricting the segmentation performance. Most of them take action labels as pseudo-labels for context and suppress context frames in class activation sequences using the attention mechanism. However, this only applies to fixed shots or videos with a single theme. For videos with frequent scene switching and complicated themes, such as casual shots of unexpected events and secret shots, the strong randomness and weak continuity of the action cause the assumption not to be valid. In addition, the wrong pseudo-labels will enhance the weight of context frames, which will affect the segmentation performance. To address above issues, in this paper, we define a new video frame division standard (action instance, action-related context, no-action background), propose an Action-aware Network with Upper and Lower loss AUL-Net, which limits the activation of context to a reasonable range through a two-branch weight-sharing framework with a three-branch attention mechanism, so that the model has wider applicability while accurately suppressing context and background. We conducted extensive experiments on the self-built food safety video dataset FS-VA, and the results show that our method outperforms the state-of-the-art model.
Literatur
1.
Zurück zum Zitat Ciptadi A, Goodwin MS, Rehg JM (2014), Movement pattern histogram for action recognition and retrieval. In: European Conference on Computer Vision, pp. 695–710. Springer Ciptadi A, Goodwin MS, Rehg JM (2014), Movement pattern histogram for action recognition and retrieval. In: European Conference on Computer Vision, pp. 695–710. Springer
2.
Zurück zum Zitat Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intel Rev 46(4):485–514 CrossRef Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intel Rev 46(4):485–514 CrossRef
3.
Zurück zum Zitat Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009 CrossRef Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009 CrossRef
4.
Zurück zum Zitat Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231 CrossRef Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231 CrossRef
5.
Zurück zum Zitat Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743
6.
Zurück zum Zitat Alwassel H, Heilbron FC, Ghanem B (2018) Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 251–266 Alwassel H, Heilbron FC, Ghanem B (2018) Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 251–266
7.
Zurück zum Zitat Liu Y, Ma L, Zhang Y, Liu W, Chang S-F (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3604–3613 (2019) Liu Y, Ma L, Zhang Y, Liu W, Chang S-F (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3604–3613 (2019)
8.
Zurück zum Zitat Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165
9.
Zurück zum Zitat Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334 Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334
10.
Zurück zum Zitat Nguyen P, Liu T, Prasad G, Han B (2018) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761 Nguyen P, Liu T, Prasad G, Han B (2018) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761
11.
Zurück zum Zitat Paul S, Roy S, Roy-Chowdhury AK (2018) W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579 Paul S, Roy S, Roy-Chowdhury AK (2018) W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579
12.
Zurück zum Zitat Zhou Z-H, Hua Z (2004) Multi-instance learning: a survey. department of computer science & technology. Technical report, Nanjing University, Tech. Rep Zhou Z-H, Hua Z (2004) Multi-instance learning: a survey. department of computer science & technology. Technical report, Nanjing University, Tech. Rep
13.
Zurück zum Zitat Shou Z, Gao H, Zhang L, Miyazawa K, Chang S-F (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 Shou Z, Gao H, Zhang L, Miyazawa K, Chang S-F (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171
14.
Zurück zum Zitat Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307 Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307
15.
Zurück zum Zitat Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11320–11327 Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11320–11327
16.
Zurück zum Zitat Qu S, Chen G, Li Z, Zhang L, Lu F, Knoll AC (2021) Acm-net: Action context modeling network for weakly-supervised temporal action localization Qu S, Chen G, Li Z, Zhang L, Lu F, Knoll AC (2021) Acm-net: Action context modeling network for weakly-supervised temporal action localization
17.
Zurück zum Zitat Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The thumos challenge on action recognition for videos “in the wild’’. Comput Vis Image Underst 155:1–23 CrossRef Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The thumos challenge on action recognition for videos “in the wild’’. Comput Vis Image Underst 155:1–23 CrossRef
18.
Zurück zum Zitat Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, pp. 961–970 Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, pp. 961–970
19.
Zurück zum Zitat Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: A survey. Image Vis Comput 60:4–21 CrossRef Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: A survey. Image Vis Comput 60:4–21 CrossRef
20.
Zurück zum Zitat Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27 (2014) Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27 (2014)
21.
Zurück zum Zitat Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
22.
Zurück zum Zitat Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879 CrossRef Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879 CrossRef
23.
Zurück zum Zitat Zhang M, Hu H, Li Z, Chen J (2022) Proposal-based graph attention networks for workflow detection. Neural Process Lett 54(1):101–123 CrossRef Zhang M, Hu H, Li Z, Chen J (2022) Proposal-based graph attention networks for workflow detection. Neural Process Lett 54(1):101–123 CrossRef
24.
Zurück zum Zitat Jiang Y, Zhou Z-H (2004) Som ensemble-based image segmentation. Neural Process Lett 20(3):171–178 CrossRef Jiang Y, Zhou Z-H (2004) Som ensemble-based image segmentation. Neural Process Lett 20(3):171–178 CrossRef
25.
Zurück zum Zitat Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
26.
Zurück zum Zitat Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898 Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898
27.
Zurück zum Zitat Lee P, Byun H (2021) Learning action completeness from points for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657 Lee P, Byun H (2021) Learning action completeness from points for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657
28.
Zurück zum Zitat Lee P, Wang J, Lu Y, Byun H (2021) Weakly-supervised temporal action localization by uncertainty modeling. In: AAAI Conference on Artificial Intelligence, vol. 2 Lee P, Wang J, Lu Y, Byun H (2021) Weakly-supervised temporal action localization by uncertainty modeling. In: AAAI Conference on Artificial Intelligence, vol. 2
29.
Zurück zum Zitat Min K, Corso JJ (2020) Adversarial background-aware loss for weakly-supervised temporal activity localization. In: European Conference on Computer Vision, pp. 283–299. Springer Min K, Corso JJ (2020) Adversarial background-aware loss for weakly-supervised temporal activity localization. In: European Conference on Computer Vision, pp. 283–299. Springer
30.
Zurück zum Zitat Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal consensus network for weakly supervised temporal action localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599 Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal consensus network for weakly supervised temporal action localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599
31.
Zurück zum Zitat He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935 He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935
32.
Zurück zum Zitat Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, pp. 214–223. Springer Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, pp. 214–223. Springer
33.
Zurück zum Zitat Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019 Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019
34.
Zurück zum Zitat Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16010–16019 Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16010–16019
Metadaten
Titel
Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization
verfasst von
Mingwen Bi
Jiaqi Li
Xinliang Liu
Qingchuan Zhang
Zhenghong Yang
Publikationsdatum
24.09.2022
Verlag
Springer US
Erschienen in
Neural Processing Letters
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-11042-x