Skip to main content
Erschienen in: International Journal of Computer Vision 1/2020

28.08.2019

Temporal Action Detection with Structured Segment Networks

verfasst von: Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, Dahua Lin

Erschienen in: International Journal of Computer Vision | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of actions. SSN models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and precise localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end manner. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping is devised to generate high quality action proposals. We further study the importance of the decomposed discriminative model and discover a way to achieve similar accuracy using a single classifier, which is also complementary with the original SSN design. On two challenging benchmarks, THUMOS’14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
4
To be specific, the 10 classes are: “Clean and Jerk”, “Hammer Throw”, “High jump”, “Javelin Throw”, “Long Jump”, “Pole Vault”, “Shotput”, “Tennis Swing”, “Throw Discus”, “Volleyball Spiking”. Note that THUMOS14 has two classes named “Cricket Bowling” and “Cricket Shot” while ActivityNet v1.2 also has one called “Cricket”. However we categorize the two classes into the unseen part since the single label in ActivityNet is unable to distinguish these two specific actions.
 
Literatur
Zurück zum Zitat Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1014–1021). IEEE. Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: People detection and articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1014–1021). IEEE.
Zurück zum Zitat Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., & Niebles, J. C. (2017a). End-to-end, single-stream temporal action detection in untrimmed videos. In The British machine vision conference (BMVC) (Vol. 2, p. 7). Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., & Niebles, J. C. (2017a). End-to-end, single-stream temporal action detection in untrimmed videos. In The British machine vision conference (BMVC) (Vol. 2, p. 7).
Zurück zum Zitat Buch, S., Escorcia, V., Shen, C., Ghanem, B., & Niebles, J. C. (2017b). SST: Single-stream temporal action proposals. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6373–6382). IEEE. Buch, S., Escorcia, V., Shen, C., Ghanem, B., & Niebles, J. C. (2017b). SST: Single-stream temporal action proposals. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6373–6382). IEEE.
Zurück zum Zitat Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 961–970). Caba Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 961–970).
Zurück zum Zitat Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1914–1923). Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1914–1923).
Zurück zum Zitat Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4724–4733). IEEE.
Zurück zum Zitat Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1130–1139). Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster R-CNN architecture for temporal action localization. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1130–1139).
Zurück zum Zitat Dai, X., Singh, B., Zhang, G., Davis, L. S., & Chen, Y. Q. (2017). Temporal context network for activity localization in videos. In The IEEE international conference on computer vision (ICCV) (pp. 5727–5736). Dai, X., Singh, B., Zhang, G., Davis, L. S., & Chen, Y. Q. (2017). Temporal context network for activity localization in videos. In The IEEE international conference on computer vision (ICCV) (pp. 5727–5736).
Zurück zum Zitat De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In European conference on computer vision (ECCV) (pp. 269–284). Springer. De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In European conference on computer vision (ECCV) (pp. 269–284). Springer.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255). Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2625–2634). Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2625–2634).
Zurück zum Zitat Escorcia, V., Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Daps: Deep action proposals for action understanding”. In European conference on computer vision (ECCV) (pp. 768–784). Escorcia, V., Caba Heilbron, F., Niebles, J. C., & Ghanem, B. (2016). Daps: Deep action proposals for action understanding”. In European conference on computer vision (ECCV) (pp. 768–784).
Zurück zum Zitat Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.CrossRef Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.CrossRef
Zurück zum Zitat Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRef
Zurück zum Zitat Fernando, B., Gavves, E., Jo, M., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5378–5387). Fernando, B., Gavves, E., Jo, M., Ghodrati, A., & Tuytelaars, T. (2015). Modeling video evolution for action recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5378–5387).
Zurück zum Zitat Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.CrossRef Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.CrossRef
Zurück zum Zitat Gao, J., Chen, K., & Nevatia, R. (2018). Ctap: Complementary temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 68–83). Gao, J., Chen, K., & Nevatia, R. (2018). Ctap: Complementary temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 68–83).
Zurück zum Zitat Gao, J., Yang, Z., & Nevatia, R. (2017). Cascaded boundary regression for temporal action detection. In The British machine vision conference (BMVC). Gao, J., Yang, Z., & Nevatia, R. (2017). Cascaded boundary regression for temporal action detection. In The British machine vision conference (BMVC).
Zurück zum Zitat Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV) (pp. 1440–1448). Girshick, R. (2015). Fast R-CNN. In The IEEE international conference on computer vision (ICCV) (pp. 1440–1448).
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587). Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).
Zurück zum Zitat Gkioxari, G., & Malik, J. (2015). Finding action tubes. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 759–768). Gkioxari, G., & Malik, J. (2015). Finding action tubes. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 759–768).
Zurück zum Zitat Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE conference on computer vision and pattern recognition (CVPR). Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2014), Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (ECCV) (pp. 346–361). Springer. He, K., Zhang, X., Ren, S., & Sun, J. (2014), Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision (ECCV) (pp. 346–361). Springer.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
Zurück zum Zitat Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3265–3272). IEEE. Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3265–3272). IEEE.
Zurück zum Zitat Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision (IJCV), 80(1), 3–15.CrossRef Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision (IJCV), 80(1), 3–15.CrossRef
Zurück zum Zitat Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830.CrossRef Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2016). What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 814–830.CrossRef
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML) (pp. 448–456). Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML) (pp. 448–456).
Zurück zum Zitat Jain, M., van Gemert, J. C., Jégou, H., Bouthemy, P., & Snoek, C. G. M. (2014). Action localization by tubelets from motion. In The IEEE conference on computer vision and pattern recognition (CVPR). Jain, M., van Gemert, J. C., Jégou, H., Bouthemy, P., & Snoek, C. G. M. (2014). Action localization by tubelets from motion. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. Retrieved April 7, 2019 from http://crcv.ucf.edu/THUMOS14/. Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. Retrieved April 7, 2019 from http://​crcv.​ucf.​edu/​THUMOS14/​.
Zurück zum Zitat Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732). Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1725–1732).
Zurück zum Zitat Lafferty, J., McCallum, A., Pereira, F., et al. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 1, 282–289. Lafferty, J., McCallum, A., Pereira, F., et al. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning (ICML), 1, 282–289.
Zurück zum Zitat Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64(2–3), 107–123. CrossRef Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision (IJCV), 64(2–3), 107–123. CrossRef
Zurück zum Zitat Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In The IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2169–2178). IEEE. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In The IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 2, pp. 2169–2178). IEEE.
Zurück zum Zitat Li, X., & Loy, C. C. (2018). Video object segmentation with joint re-identification and attention-aware mask propagation. In The European conference on computer vision (ECCV) (pp. 90–105). Li, X., & Loy, C. C. (2018). Video object segmentation with joint re-identification and attention-aware mask propagation. In The European conference on computer vision (ECCV) (pp. 90–105).
Zurück zum Zitat Li, Y., He, K., Sun, J., et al. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Neural information processing systems (NIPS) (pp. 379–387). Li, Y., He, K., Sun, J., et al. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Neural information processing systems (NIPS) (pp. 379–387).
Zurück zum Zitat Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia (pp. 988–996). ACM. Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia (pp. 988–996). ACM.
Zurück zum Zitat Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). BSN: Boundary sensitive network for temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 3–19). Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). BSN: Boundary sensitive network for temporal action proposal generation. In The European conference on computer vision (ECCV) (pp. 3–19).
Zurück zum Zitat Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37). Springer. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37). Springer.
Zurück zum Zitat Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision (ECCV) (pp. 437–453). Springer. Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision (ECCV) (pp. 437–453). Springer.
Zurück zum Zitat Mettes, P., van Gemert, J. C., Cappallo, S., Mensink, T., & Snoek, C. G. (2015). Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In ACM international conference on multimedia retrieval (ICMR) (pp. 427–434). Mettes, P., van Gemert, J. C., Cappallo, S., Mensink, T., & Snoek, C. G. (2015). Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In ACM international conference on multimedia retrieval (ICMR) (pp. 427–434).
Zurück zum Zitat Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS workshop on large scale computer vision systems. Montes, A., Salvador, A., Pascual, S., & Giro-i Nieto, X. (2016). Temporal activity detection in untrimmed videos with recurrent neural networks. In NIPS workshop on large scale computer vision systems.
Zurück zum Zitat Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702). Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4694–4702).
Zurück zum Zitat Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018) Weakly supervised action localization by sparse temporal pooling network. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6752–6761). Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018) Weakly supervised action localization by sparse temporal pooling network. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6752–6761).
Zurück zum Zitat Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV) (pp. 392–405). Springer. Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision (ECCV) (pp. 392–405). Springer.
Zurück zum Zitat Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In The IEEE international conference on computer vision (ICCV) (pp. 1817–1824). Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In The IEEE international conference on computer vision (ICCV) (pp. 1817–1824).
Zurück zum Zitat Oneata, D., Verbeek, J., & Schmid, C. (2014). The lear submission at thumos 2014. In THUMOS action recognition challenge. Oneata, D., Verbeek, J., & Schmid, C. (2014). The lear submission at thumos 2014. In THUMOS action recognition challenge.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.MathSciNetMATH
Zurück zum Zitat Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (ECCV). Springer. Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In European conference on computer vision (ECCV). Springer.
Zurück zum Zitat Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 612–619). Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 612–619).
Zurück zum Zitat Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.CrossRef Pont-Tuset, J., Arbelaez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 128–140.CrossRef
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS) (pp. 91–99). Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NIPS) (pp. 91–99).
Zurück zum Zitat Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In The IEEE conference on computer vision and pattern recognition (CVPR)( pp. 3131–3140). Richard, A., & Gall, J. (2016). Temporal action detection using a statistical language model. In The IEEE conference on computer vision and pattern recognition (CVPR)( pp. 3131–3140).
Zurück zum Zitat Roerdink, J. B., & Meijster, A. (2000). The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta Informaticae, 41(1,2), 187–228.MathSciNetCrossRef Roerdink, J. B., & Meijster, A. (2000). The watershed transform: Definitions, algorithms and parallelization strategies. Fundamenta Informaticae, 41(1,2), 187–228.MathSciNetCrossRef
Zurück zum Zitat Schindler, K., & Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE. Schindler, K., & Van Gool, L. (2008). Action snippets: How many frames does human action recognition require? In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8). IEEE.
Zurück zum Zitat Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S. F. (2017). CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1417–1426). Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S. F. (2017). CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1417–1426).
Zurück zum Zitat Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, S. F. (2018). AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision (ECCV) (pp. 154–171). Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, S. F. (2018). AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision (ECCV) (pp. 154–171).
Zurück zum Zitat Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1049–1058). Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1049–1058).
Zurück zum Zitat Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 761–769). Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 761–769).
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Neural information processing systems (NIPS) (pp. 568–576). Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Neural information processing systems (NIPS) (pp. 568–576).
Zurück zum Zitat Singh, G., & Cuzzolin, F. (2016). Untrimmed video classification for activity detection: Submission to activitynet challenge. CoRR abs/1607.01979 Singh, G., & Cuzzolin, F. (2016). Untrimmed video classification for activity detection: Submission to activitynet challenge. CoRR abs/1607.01979
Zurück zum Zitat Singh, B., Marks, T. K., Jones, M., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1961–1970). Singh, B., Marks, T. K., Jones, M., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1961–1970).
Zurück zum Zitat Soomro, K., Zamir, A. R., & Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 Soomro, K., Zamir, A. R., & Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:​1212.​0402
Zurück zum Zitat Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826). Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2818–2826).
Zurück zum Zitat Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2696–2703). Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2696–2703).
Zurück zum Zitat Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In The IEEE international conference on computer vision (ICCV) (pp. 4489–4497). Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In The IEEE international conference on computer vision (ICCV) (pp. 4489–4497).
Zurück zum Zitat Van de Sande, K. E., Uijlings, J. R., Gevers, T., & Smeulders, A. W. (2011). Segmentation as selective search for object recognition. In The IEEE international conference on computer vision (ICCV) (pp. 1879–1886). Van de Sande, K. E., Uijlings, J. R., Gevers, T., & Smeulders, A. W. (2011). Segmentation as selective search for object recognition. In The IEEE international conference on computer vision (ICCV) (pp. 1879–1886).
Zurück zum Zitat Van Gemert, J. C., Jain, M., Gati, E., Snoek, C. G., et al. (2015). APT: Action localization proposals from dense trajectories. In The British machine vision conference (BMVC) (Vol. 2, p. 4). Van Gemert, J. C., Jain, M., Gati, E., Snoek, C. G., et al. (2015). APT: Action localization proposals from dense trajectories. In The British machine vision conference (BMVC) (Vol. 2, p. 4).
Zurück zum Zitat Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In The IEEE international conference on computer vision (ICCV) (pp. 3551–3558). Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In The IEEE international conference on computer vision (ICCV) (pp. 3551–3558).
Zurück zum Zitat Wang, R., & Tao, D. (2016). UTS at activitynet 2016. In AcitivityNet large scale activity recognition challenge 2016. Wang, R., & Tao, D. (2016). UTS at activitynet 2016. In AcitivityNet large scale activity recognition challenge 2016.
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2014a). Action recognition and detection by combining motion and appearance features. In THUMOS action recognition challenge. Wang, L., Qiao, Y., & Tang, X. (2014a). Action recognition and detection by combining motion and appearance features. In THUMOS action recognition challenge.
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2014b). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.MathSciNetCrossRef Wang, L., Qiao, Y., & Tang, X. (2014b). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.MathSciNetCrossRef
Zurück zum Zitat Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4305–4314). Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4305–4314).
Zurück zum Zitat Wang, L., Qiao, Y., Tang, X., & Van Gool, L. (2016a). Actionness estimation using hybrid fully convolutional networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2708–2717). Wang, L., Qiao, Y., Tang, X., & Van Gool, L. (2016a). Actionness estimation using hybrid fully convolutional networks. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2708–2717).
Zurück zum Zitat Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In The IEEE conference on computer vision and pattern recognition (CVPR). Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (ECCV) (pp. 20–36). Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016b). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (ECCV) (pp. 20–36).
Zurück zum Zitat Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zurück zum Zitat Wang, P., Cao, Y., Shen, C., Liu, L., & Shen, H. T. (2016c). Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2613–2622. CrossRef Wang, P., Cao, Y., Shen, C., Liu, L., & Shen, H. T. (2016c). Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2613–2622. CrossRef
Zurück zum Zitat Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In The IEEE international conference on computer vision (ICCV) (pp. 3164–3172). Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Learning to track for spatio-temporal action localization. In The IEEE international conference on computer vision (ICCV) (pp. 3164–3172).
Zurück zum Zitat Xu, H., Das, A., & Saenko, K. (2017). R-C3D: Region convolutional 3D network for temporal activity detection. In The IEEE international conference on computer vision (ICCV) (Vol. 6, p. 8). Xu, H., Das, A., & Saenko, K. (2017). R-C3D: Region convolutional 3D network for temporal activity detection. In The IEEE international conference on computer vision (ICCV) (Vol. 6, p. 8).
Zurück zum Zitat Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2678–2687). Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2678–2687).
Zurück zum Zitat Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3093–3102). Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3093–3102).
Zurück zum Zitat Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-\(L^1\) optical flow. In 29th DAGM symposium on pattern recognition (pp. 214–223). Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-\(L^1\) optical flow. In 29th DAGM symposium on pattern recognition (pp. 214–223).
Zurück zum Zitat Zhang, D., Dai, X., Wang, X., & Wang, Y. F. (2018). \(\rm S^3D\): Single shot multi-span detector via fully 3d convolutional network. In The British machine vision conference (BMVC). Zhang, D., Dai, X., Wang, X., & Wang, Y. F. (2018). \(\rm S^3D\): Single shot multi-span detector via fully 3d convolutional network. In The British machine vision conference (BMVC).
Zurück zum Zitat Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2718–2726). Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2718–2726).
Zurück zum Zitat Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017a). Temporal action detection with structured segment networks. The IEEE International Conference on Computer Vision (ICCV), 8, 2914–2923. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017a). Temporal action detection with structured segment networks. The IEEE International Conference on Computer Vision (ICCV), 8, 2914–2923.
Zurück zum Zitat Zhao, Y., Zhang, B., Wu, Z., Yang, S., Zhou, L., Yan, S., Wang, L., Xiong, Y., Lin, D., & Qiao, Y. (2017b). CUHK & ETHZ & SIAT submission to Activitynet Challenge 2017. arXiv:1710.08011 Zhao, Y., Zhang, B., Wu, Z., Yang, S., Zhou, L., Yan, S., Wang, L., Xiong, Y., Lin, D., & Qiao, Y. (2017b). CUHK & ETHZ & SIAT submission to Activitynet Challenge 2017. arXiv:​1710.​08011
Zurück zum Zitat Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV) (pp. 391–405). Zitnick, C. L., & Dollár, P. (2014). Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV) (pp. 391–405).
Metadaten
Titel
Temporal Action Detection with Structured Segment Networks
verfasst von
Yue Zhao
Yuanjun Xiong
Limin Wang
Zhirong Wu
Xiaoou Tang
Dahua Lin
Publikationsdatum
28.08.2019
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 1/2020
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-019-01211-2

Weitere Artikel der Ausgabe 1/2020

International Journal of Computer Vision 1/2020 Zur Ausgabe