nach oben

International Journal of Computer Vision

Erschienen in:

28.08.2019

Temporal Action Detection with Structured Segment Networks

verfasst von: Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of actions. SSN models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and precise localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end manner. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping is devised to generate high quality action proposals. We further study the importance of the decomposed discriminative model and discover a way to achieve similar accuracy using a single classifier, which is also complementary with the original SSN design. On two challenging benchmarks, THUMOS’14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.

Vorheriger Artikel Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

Nächster Artikel Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://yjxiong.me/others/ssn and https://github.com/open-mmlab/mmaction.

http://activity-net.org/index.html.

http://crcv.ucf.edu/THUMOS14/.

To be specific, the 10 classes are: “Clean and Jerk”, “Hammer Throw”, “High jump”, “Javelin Throw”, “Long Jump”, “Pole Vault”, “Shotput”, “Tennis Swing”, “Throw Discus”, “Volleyball Spiking”. Note that THUMOS14 has two classes named “Cricket Bowling” and “Cricket Shot” while ActivityNet v1.2 also has one called “Cricket”. However we categorize the two classes into the unseen part since the single label in ActivityNet is unable to distinguish these two specific actions.

The ActivityNet 2016 challenge summary is provided here: http://activity-net.org/challenges/2016/data/anet_challenge_summary.pdf.

Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1014–1021). IEEE.

Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., & Niebles, J. C. (2017a). End-to-end, single-stream temporal action detection in untrimmed videos. In The British machine vision conference (BMVC) (Vol. 2, p. 7).

Buch, S., Escorcia, V., Shen, C., Ghanem, B., & Niebles, J. C. (2017b). SST: Single-stream temporal action proposals. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6373–6382). IEEE.

Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 961–970).

Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1914–1923).

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.

Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1130–1139).

Dai, X., Singh, B., Zhang, G., Davis, L. S., & Chen, Y. Q. (2017). Temporal context network for activity localization in videos. In The IEEE international conference on computer vision (ICCV) (pp. 5727–5736).

De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In European conference on computer vision (ECCV) (pp. 269–284). Springer.

Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2625–2634).

Escorcia, V., Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Daps: Deep action proposals for action understanding”. In European conference on computer vision (ECCV) (pp. 768–784).

Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.CrossRef

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef

Fernando, B., Gavves, E., Jo, M., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5378–5387).

Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.CrossRef

Gao, J., Chen, K., & Nevatia, R. (2018). Ctap: Complementary temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 68–83).

Gao, J., Yang, Z., & Nevatia, R. (2017). Cascaded boundary regression for temporal action detection. In The British machine vision conference (BMVC).

Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV) (pp. 1440–1448).

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

Gkioxari, G., & Malik, J. (2015). Finding action tubes. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 759–768).

Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE conference on computer vision and pattern recognition (CVPR).

He, K., Zhang, X., Ren, S., & Sun, J. (2014), Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (ECCV) (pp. 346–361). Springer.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3265–3272). IEEE.

Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision (IJCV), 80(1), 3–15.CrossRef

Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830.CrossRef

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML) (pp. 448–456).

Jain, M., van Gemert, J. C., Jégou, H., Bouthemy, P., & Snoek, C. G. M. (2014). Action localization by tubelets from motion. In The IEEE conference on computer vision and pattern recognition (CVPR).

Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. Retrieved April 7, 2019 from http://crcv.ucf.edu/THUMOS14/.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732).

Lafferty, J., McCallum, A., Pereira, F., et al. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 1, 282–289.

Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64(2–3), 107–123. CrossRef

Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In The IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2169–2178). IEEE.

Li, X., & Loy, C. C. (2018). Video object segmentation with joint re-identification and attention-aware mask propagation. In The European conference on computer vision (ECCV) (pp. 90–105).

Li, Y., He, K., Sun, J., et al. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Neural information processing systems (NIPS) (pp. 379–387).

Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia (pp. 988–996). ACM.

Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). BSN: Boundary sensitive network for temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 3–19).

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37). Springer.

Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision (ECCV) (pp. 437–453). Springer.

Mettes, P., van Gemert, J. C., Cappallo, S., Mensink, T., & Snoek, C. G. (2015). Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In ACM international conference on multimedia retrieval (ICMR) (pp. 427–434).

Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS workshop on large scale computer vision systems.

Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702).

Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018) Weakly supervised action localization by sparse temporal pooling network. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6752–6761).

Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV) (pp. 392–405). Springer.

Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In The IEEE international conference on computer vision (ICCV) (pp. 1817–1824).

Oneata, D., Verbeek, J., & Schmid, C. (2014). The lear submission at thumos 2014. In THUMOS action recognition challenge.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATH

Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (ECCV). Springer.

Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 612–619).

Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.CrossRef

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS) (pp. 91–99).

Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In The IEEE conference on computer vision and pattern recognition (CVPR)( pp. 3131–3140).

Roerdink, J. B., & Meijster, A. (2000). The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta Informaticae, 41(1,2), 187–228.MathSciNetCrossRef

Schindler, K., & Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE.

Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S. F. (2017). CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1417–1426).

Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, S. F. (2018). AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision (ECCV) (pp. 154–171).

Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1049–1058).

Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 761–769).

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Neural information processing systems (NIPS) (pp. 568–576).

Singh, G., & Cuzzolin, F. (2016). Untrimmed video classification for activity detection: Submission to activitynet challenge. CoRR abs/1607.01979

Singh, B., Marks, T. K., Jones, M., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1961–1970).

Soomro, K., Zamir, A. R., & Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826).

Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2696–2703).

Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In The IEEE international conference on computer vision (ICCV) (pp. 4489–4497).

Van de Sande, K. E., Uijlings, J. R., Gevers, T., & Smeulders, A. W. (2011). Segmentation as selective search for object recognition. In The IEEE international conference on computer vision (ICCV) (pp. 1879–1886).

Van Gemert, J. C., Jain, M., Gati, E., Snoek, C. G., et al. (2015). APT: Action localization proposals from dense trajectories. In The British machine vision conference (BMVC) (Vol. 2, p. 4).

Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In The IEEE international conference on computer vision (ICCV) (pp. 3551–3558).

Wang, R., & Tao, D. (2016). UTS at activitynet 2016. In AcitivityNet large scale activity recognition challenge 2016.

Wang, L., Qiao, Y., & Tang, X. (2014a). Action recognition and detection by combining motion and appearance features. In THUMOS action recognition challenge.

Wang, L., Qiao, Y., & Tang, X. (2014b). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.MathSciNetCrossRef

Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4305–4314).

Wang, L., Qiao, Y., Tang, X., & Van Gool, L. (2016a). Actionness estimation using hybrid fully convolutional networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2708–2717).

Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In The IEEE conference on computer vision and pattern recognition (CVPR).

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (ECCV) (pp. 20–36).

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Wang, P., Cao, Y., Shen, C., Liu, L., & Shen, H. T. (2016c). Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2613–2622. CrossRef

Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In The IEEE international conference on computer vision (ICCV) (pp. 3164–3172).

Xu, H., Das, A., & Saenko, K. (2017). R-C3D: Region convolutional 3D network for temporal activity detection. In The IEEE international conference on computer vision (ICCV) (Vol. 6, p. 8).

Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2678–2687).

Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3093–3102).

Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-\(L^1\) optical flow. In 29th DAGM symposium on pattern recognition (pp. 214–223).

Zhang, D., Dai, X., Wang, X., & Wang, Y. F. (2018). \(\rm S^3D\): Single shot multi-span detector via fully 3d convolutional network. In The British machine vision conference (BMVC).

Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2718–2726).

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017a). Temporal action detection with structured segment networks. The IEEE International Conference on Computer Vision (ICCV), 8, 2914–2923.

Zhao, Y., Zhang, B., Wu, Z., Yang, S., Zhou, L., Yan, S., Wang, L., Xiong, Y., Lin, D., & Qiao, Y. (2017b). CUHK & ETHZ & SIAT submission to Activitynet Challenge 2017. arXiv:1710.08011

Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV) (pp. 391–405).

Titel: Temporal Action Detection with Structured Segment Networks
verfasst von: Yue Zhao
Yuanjun Xiong
Limin Wang
Zhirong Wu
Xiaoou Tang
Dahua Lin
Publikationsdatum: 28.08.2019
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 1/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-019-01211-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2020

Shape-From-Template with Curves

Classifier and Exemplar Synthesis for Zero-Shot Learning

Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance

Single Image Dehazing via Multi-scale Convolutional Neural Networks with Holistic Edges

Tracking Persons-of-Interest via Unsupervised Representation Adaptation

ARBEE: Towards Automated Recognition of Bodily Expression of Emotion in the Wild