Skip to main content
main-content
Top

Hint

Swipe to navigate through the articles of this issue

24-09-2022

Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization

Authors: Mingwen Bi, Jiaqi Li, Xinliang Liu, Qingchuan Zhang, Zhenghong Yang

Published in: Neural Processing Letters

Login to get access
share
SHARE

Abstract

Weakly-supervised temporal action localization aims to detect the temporal boundaries of action instances in untrimmed videos only by relying on video-level action labels. The main challenge of the research is to accurately segment the action from the background in the absence of frame-level labels. Previous methods consider the action-related context in the background as the main factor restricting the segmentation performance. Most of them take action labels as pseudo-labels for context and suppress context frames in class activation sequences using the attention mechanism. However, this only applies to fixed shots or videos with a single theme. For videos with frequent scene switching and complicated themes, such as casual shots of unexpected events and secret shots, the strong randomness and weak continuity of the action cause the assumption not to be valid. In addition, the wrong pseudo-labels will enhance the weight of context frames, which will affect the segmentation performance. To address above issues, in this paper, we define a new video frame division standard (action instance, action-related context, no-action background), propose an Action-aware Network with Upper and Lower loss AUL-Net, which limits the activation of context to a reasonable range through a two-branch weight-sharing framework with a three-branch attention mechanism, so that the model has wider applicability while accurately suppressing context and background. We conducted extensive experiments on the self-built food safety video dataset FS-VA, and the results show that our method outperforms the state-of-the-art model.
Literature
1.
go back to reference Ciptadi A, Goodwin MS, Rehg JM (2014), Movement pattern histogram for action recognition and retrieval. In: European Conference on Computer Vision, pp. 695–710. Springer Ciptadi A, Goodwin MS, Rehg JM (2014), Movement pattern histogram for action recognition and retrieval. In: European Conference on Computer Vision, pp. 695–710. Springer
2.
go back to reference Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intel Rev 46(4):485–514 CrossRef Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intel Rev 46(4):485–514 CrossRef
3.
go back to reference Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009 CrossRef Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009 CrossRef
4.
go back to reference Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231 CrossRef Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231 CrossRef
5.
go back to reference Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743
6.
go back to reference Alwassel H, Heilbron FC, Ghanem B (2018) Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 251–266 Alwassel H, Heilbron FC, Ghanem B (2018) Action search: Spotting actions in videos and its application to temporal action localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 251–266
7.
go back to reference Liu Y, Ma L, Zhang Y, Liu W, Chang S-F (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3604–3613 (2019) Liu Y, Ma L, Zhang Y, Liu W, Chang S-F (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3604–3613 (2019)
8.
go back to reference Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165 Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165
9.
go back to reference Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334 Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334
10.
go back to reference Nguyen P, Liu T, Prasad G, Han B (2018) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761 Nguyen P, Liu T, Prasad G, Han B (2018) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761
11.
go back to reference Paul S, Roy S, Roy-Chowdhury AK (2018) W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579 Paul S, Roy S, Roy-Chowdhury AK (2018) W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579
12.
go back to reference Zhou Z-H, Hua Z (2004) Multi-instance learning: a survey. department of computer science & technology. Technical report, Nanjing University, Tech. Rep Zhou Z-H, Hua Z (2004) Multi-instance learning: a survey. department of computer science & technology. Technical report, Nanjing University, Tech. Rep
13.
go back to reference Shou Z, Gao H, Zhang L, Miyazawa K, Chang S-F (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 Shou Z, Gao H, Zhang L, Miyazawa K, Chang S-F (2018) Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171
14.
go back to reference Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307 Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307
15.
go back to reference Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11320–11327 Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11320–11327
16.
go back to reference Qu S, Chen G, Li Z, Zhang L, Lu F, Knoll AC (2021) Acm-net: Action context modeling network for weakly-supervised temporal action localization Qu S, Chen G, Li Z, Zhang L, Lu F, Knoll AC (2021) Acm-net: Action context modeling network for weakly-supervised temporal action localization
17.
go back to reference Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The thumos challenge on action recognition for videos “in the wild’’. Comput Vis Image Underst 155:1–23 CrossRef Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The thumos challenge on action recognition for videos “in the wild’’. Comput Vis Image Underst 155:1–23 CrossRef
18.
go back to reference Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, pp. 961–970 Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, pp. 961–970
19.
go back to reference Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: A survey. Image Vis Comput 60:4–21 CrossRef Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: A survey. Image Vis Comput 60:4–21 CrossRef
20.
go back to reference Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27 (2014) Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27 (2014)
21.
go back to reference Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
22.
go back to reference Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879 CrossRef Yu Z, Song Y, Yu J, Wang M, Huang Q (2020) Intra-and inter-modal multilinear pooling with multitask learning for video grounding. Neural Process Lett 52(3):1863–1879 CrossRef
23.
go back to reference Zhang M, Hu H, Li Z, Chen J (2022) Proposal-based graph attention networks for workflow detection. Neural Process Lett 54(1):101–123 CrossRef Zhang M, Hu H, Li Z, Chen J (2022) Proposal-based graph attention networks for workflow detection. Neural Process Lett 54(1):101–123 CrossRef
24.
go back to reference Jiang Y, Zhou Z-H (2004) Som ensemble-based image segmentation. Neural Process Lett 20(3):171–178 CrossRef Jiang Y, Zhou Z-H (2004) Som ensemble-based image segmentation. Neural Process Lett 20(3):171–178 CrossRef
25.
go back to reference Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
26.
go back to reference Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898 Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898
27.
go back to reference Lee P, Byun H (2021) Learning action completeness from points for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657 Lee P, Byun H (2021) Learning action completeness from points for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657
28.
go back to reference Lee P, Wang J, Lu Y, Byun H (2021) Weakly-supervised temporal action localization by uncertainty modeling. In: AAAI Conference on Artificial Intelligence, vol. 2 Lee P, Wang J, Lu Y, Byun H (2021) Weakly-supervised temporal action localization by uncertainty modeling. In: AAAI Conference on Artificial Intelligence, vol. 2
29.
go back to reference Min K, Corso JJ (2020) Adversarial background-aware loss for weakly-supervised temporal activity localization. In: European Conference on Computer Vision, pp. 283–299. Springer Min K, Corso JJ (2020) Adversarial background-aware loss for weakly-supervised temporal activity localization. In: European Conference on Computer Vision, pp. 283–299. Springer
30.
go back to reference Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal consensus network for weakly supervised temporal action localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599 Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal consensus network for weakly supervised temporal action localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599
31.
go back to reference He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935 He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935
32.
go back to reference Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, pp. 214–223. Springer Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, pp. 214–223. Springer
33.
go back to reference Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019 Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019
34.
go back to reference Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16010–16019 Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16010–16019
Metadata
Title
Action-Aware Network with Upper and Lower Limit Loss for Weakly-Supervised Temporal Action Localization
Authors
Mingwen Bi
Jiaqi Li
Xinliang Liu
Qingchuan Zhang
Zhenghong Yang
Publication date
24-09-2022
Publisher
Springer US
Published in
Neural Processing Letters
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-022-11042-x