Skip to main content
Top
Published in: International Journal of Computer Vision 5/2020

28-10-2019

A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation

Authors: Yan Yan, Chenliang Xu, Dawen Cai, Jason J. Corso

Published in: International Journal of Computer Vision | Issue 5/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Modeling human behaviors and activity patterns has attracted significant research interest in recent years. In order to accurately model human behaviors, we need to perform fine-grained human activity understanding in videos. Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel Schatten p-norm robust multi-task ranking model for weakly-supervised actor–action segmentation where only video-level tags are given for training samples. Our model is able to share useful information among different actors and actions while learning a ranking matrix to select representative supervoxels for actors and actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for video parts. Extensive experimental results on both the actor–action dataset and the Youtube-objects dataset demonstrate that the proposed approach outperforms the state-of-the-art weakly supervised methods and performs as well as the top-performing fully supervised method.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. Technical report. Preprint arXiv:1609.08675. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. Technical report. Preprint arXiv:​1609.​08675.
go back to reference Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In SIGIR. Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In SIGIR.
go back to reference Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. In NIPS. Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. In NIPS.
go back to reference Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.
go back to reference Brendel, W., & Todorovic, S. (2009). Video object segmentation by tracking regions. In ICCV. Brendel, W., & Todorovic, S. (2009). Video object segmentation by tracking regions. In ICCV.
go back to reference Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV. Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV.
go back to reference Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.
go back to reference Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In SIGIR. Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In SIGIR.
go back to reference Chao, Y. W., Wang, Z., Mihalcea, R., & Deng, J. (2015). Mining semantic affordances of visual object categories. In CVPR. Chao, Y. W., Wang, Z., Mihalcea, R., & Deng, J. (2015). Mining semantic affordances of visual object categories. In CVPR.
go back to reference Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM SIGKDD conferences on knowledge discovery and data mining. Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM SIGKDD conferences on knowledge discovery and data mining.
go back to reference Chen, W., & Corso, J. J. (2015). Action detection by implicit intentional motion clustering. In ICCV. Chen, W., & Corso, J. J. (2015). Action detection by implicit intentional motion clustering. In ICCV.
go back to reference Chiu, W. C., & Fritz, M. (2013). Multi-class video co-segmentation with a generative multi-video model. In CVPR. Chiu, W. C., & Fritz, M. (2013). Multi-class video co-segmentation with a generative multi-video model. In CVPR.
go back to reference Corso, J. J., Sharon, E., Dube, S., El-Saden, S., Sinha, U., & Yuille, A. (2008). Efficient multilevel brain tumor segmentation with integrated Bayesian model classification. IEEE Transactions on Medical Imaging, 27, 629–640.CrossRef Corso, J. J., Sharon, E., Dube, S., El-Saden, S., Sinha, U., & Yuille, A. (2008). Efficient multilevel brain tumor segmentation with integrated Bayesian model classification. IEEE Transactions on Medical Imaging, 27, 629–640.CrossRef
go back to reference Dang, K., Zhou, C., Tu, Z., Hoy, M., Dauwels, J., & Yuan, J. (2018). Actor action semantic segmentation with region masks. In BMVC. Dang, K., Zhou, C., Tu, Z., Hoy, M., Dauwels, J., & Yuan, J. (2018). Actor action semantic segmentation with region masks. In BMVC.
go back to reference Delong, A., Osokin, A., Isack, H. N., & Boykov, Y. (2012). Fast approximate energy minimization with label costs. International Journal of Computer Vision, 96(1), 1–27.MathSciNetCrossRef Delong, A., Osokin, A., Isack, H. N., & Boykov, Y. (2012). Fast approximate energy minimization with label costs. International Journal of Computer Vision, 96(1), 1–27.MathSciNetCrossRef
go back to reference Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.MathSciNetCrossRef Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.MathSciNetCrossRef
go back to reference Dp, B. (1996). Constrained optimization and lagrange multiplier methods. Belmont: Athena Scientific. Dp, B. (1996). Constrained optimization and lagrange multiplier methods. Belmont: Athena Scientific.
go back to reference Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD.
go back to reference Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.CrossRef Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.CrossRef
go back to reference Fu, H., Xu, D., Zhang, B., & Lin, S. (2014). Object-based multiple foreground video co-segmentation. In CVPR. Fu, H., Xu, D., Zhang, B., & Lin, S. (2014). Object-based multiple foreground video co-segmentation. In CVPR.
go back to reference Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In ICCV. Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In ICCV.
go back to reference Gabay, D., & Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2(1), 17–40.CrossRef Gabay, D., & Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2(1), 17–40.CrossRef
go back to reference Galasso, F., Cipolla, R., & Schiele, B. (2012). Video segmentation with superpixels. In Asian conference on computer vision. Galasso, F., Cipolla, R., & Schiele, B. (2012). Video segmentation with superpixels. In Asian conference on computer vision.
go back to reference Gavrilyuk, K., Ghodrati, A., Li, Z., & Snoek, C. G. (2018). Actor and action video segmentation from a sentence. In CVPR. Gavrilyuk, K., Ghodrati, A., Li, Z., & Snoek, C. G. (2018). Actor and action video segmentation from a sentence. In CVPR.
go back to reference Geest, R. D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In ECCV. Geest, R. D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In ECCV.
go back to reference Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.CrossRef Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.CrossRef
go back to reference Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR. Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR.
go back to reference Guo, J., Li, Z., Cheong, L. F., & Zhou, S. Z. (2013). Video co-segmentation for meaningful action extraction. In ICCV. Guo, J., Li, Z., Cheong, L. F., & Zhou, S. Z. (2013). Video co-segmentation for meaningful action extraction. In ICCV.
go back to reference Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.CrossRef Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.CrossRef
go back to reference Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., et al. (2012). Weakly supervised learning of object segmentations from web-scale video. In ECCV workshops (pp. 198–208). Berlin: Springer. Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., et al. (2012). Weakly supervised learning of object segmentations from web-scale video. In ECCV workshops (pp. 198–208). Berlin: Springer.
go back to reference Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In IEEE international conference on pattern recognition. Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In IEEE international conference on pattern recognition.
go back to reference Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. In NIPS. Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. In NIPS.
go back to reference Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., & Snoek, C., et al. (2014). Action localization with tubelets from motion. In CVPR. Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., & Snoek, C., et al. (2014). Action localization with tubelets from motion. In CVPR.
go back to reference Jain, S., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV. Jain, S., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV.
go back to reference Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A dirty model for multi-task learning. In NIPS. Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A dirty model for multi-task learning. In NIPS.
go back to reference Ji, J., Buch, S., Soto, A., & Niebles, J. C. (2018). End-to-end joint semantic segmentation of actors and actions in video. In ECCV. Ji, J., Buch, S., Soto, A., & Niebles, J. C. (2018). End-to-end joint semantic segmentation of actors and actions in video. In ECCV.
go back to reference Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD conferences on knowledge discovery and data mining. Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD conferences on knowledge discovery and data mining.
go back to reference Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank–Wolfe algorithm. In ECCV. Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank–Wolfe algorithm. In ECCV.
go back to reference Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In ICCV. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In ICCV.
go back to reference Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
go back to reference Krähenbühl, P., & Keltun, V. (2011a). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS. Krähenbühl, P., & Keltun, V. (2011a). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.
go back to reference Krähenbühl, P., & Koltun, V. (2011b). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS. Krähenbühl, P., & Koltun, V. (2011b). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.
go back to reference Kumar, M., Torr, P., & Zisserman, A. (2005). Learning layered motion segmentations of video. In ICCV. Kumar, M., Torr, P., & Zisserman, A. (2005). Learning layered motion segmentations of video. In ICCV.
go back to reference Kundu, A., Vineet, V., & Koltun, V. (2016). Feature space optimization for semantic video segmentation. In CVPR. Kundu, A., Vineet, V., & Koltun, V. (2016). Feature space optimization for semantic video segmentation. In CVPR.
go back to reference Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2014). Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1056–1077.CrossRef Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2014). Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1056–1077.CrossRef
go back to reference Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
go back to reference Lea, C., Reiter, A., Vidal, R., & Hager, G.D. (2016). Segmental spatiotemporal CNNs for fine-grained action segmentation. In ECCV. Lea, C., Reiter, A., Vidal, R., & Hager, G.D. (2016). Segmental spatiotemporal CNNs for fine-grained action segmentation. In ECCV.
go back to reference Lezama, J., Alahari, K., Josef, S., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR. Lezama, J., Alahari, K., Josef, S., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR.
go back to reference Lin, G., Shen, C., van den Hengel, A., & Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In CVPR. Lin, G., Shen, C., van den Hengel, A., & Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In CVPR.
go back to reference Liu, B., & He, X. (2015). Multiclass semantic video segmentation with object-level active inference. In CVPR. Liu, B., & He, X. (2015). Multiclass semantic video segmentation with object-level active inference. In CVPR.
go back to reference Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRef Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRef
go back to reference Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., & Bu, J. (2014). Weakly supervised multiclass video segmentation. In CVPR. Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., & Bu, J. (2014). Weakly supervised multiclass video segmentation. In CVPR.
go back to reference Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
go back to reference Lu, J., Xu, R., & Corso, J. J. (2015). Human action segmentation with hierarchical supervoxel consistency. In CVPR. Lu, J., Xu, R., & Corso, J. J. (2015). Human action segmentation with hierarchical supervoxel consistency. In CVPR.
go back to reference Luo, Y., Tao, D., Geng, B., Xu, C., & Maybank, S. (2013). Manifold regularized multitask learning for semi-supervised multilabel image classification. IEEE Transactions on Transactions on Pattern Recognition and Machine Intelligence, 22(2), 523–536.MathSciNetMATH Luo, Y., Tao, D., Geng, B., Xu, C., & Maybank, S. (2013). Manifold regularized multitask learning for semi-supervised multilabel image classification. IEEE Transactions on Transactions on Pattern Recognition and Machine Intelligence, 22(2), 523–536.MathSciNetMATH
go back to reference Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In ECCV. Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In ECCV.
go back to reference Mosabbeb, E. A., Cabral, R., De la Torre, F., & Fathy, M. (2014). Multi-label discriminative weakly-supervised human activity recognition and localization. In Asian conference on computer vision. Mosabbeb, E. A., Cabral, R., De la Torre, F., & Fathy, M. (2014). Multi-label discriminative weakly-supervised human activity recognition and localization. In Asian conference on computer vision.
go back to reference Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends \({}^{\textregistered }\) in Optimization, 1(3), 127–239. Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends \({}^{\textregistered }\) in Optimization, 1(3), 127–239.
go back to reference Paris, S. (2008). Edge-preserving smoothing and mean-shift segmentation of video streams. In ECCV. Paris, S. (2008). Edge-preserving smoothing and mean-shift segmentation of video streams. In ECCV.
go back to reference Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In ECCV. Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In ECCV.
go back to reference Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A. (2016). The curious robot: Learning visual representations via physical interactions. In ECCV. Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A. (2016). The curious robot: Learning visual representations via physical interactions. In ECCV.
go back to reference Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR. Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR.
go back to reference Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR. Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.
go back to reference Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV. Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.
go back to reference Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In CVPR. Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In CVPR.
go back to reference Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In IEEE international conference on pattern recognition. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In IEEE international conference on pattern recognition.
go back to reference Sculley, D. (2010). Combined regression and ranking. In KDD. Sculley, D. (2010). Combined regression and ranking. In KDD.
go back to reference Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR. Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR.
go back to reference Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
go back to reference Song, Y. C., Naim, I., Al Mamun, A., Kulkarni, K., Singla, P., Luo, J., Gildea, D., & Kautz, H. (2016). Unsupervised alignment of actions in video with text descriptions. In International joint conference on artificial intelligence. Song, Y. C., Naim, I., Al Mamun, A., Kulkarni, K., Singla, P., Luo, J., Gildea, D., & Kautz, H. (2016). Unsupervised alignment of actions in video with text descriptions. In International joint conference on artificial intelligence.
go back to reference Soomro, K., Idrees, H., & Shah, M. (2016). Predicting the where and what of actors and actions through online action localization. In CVPR. Soomro, K., Idrees, H., & Shah, M. (2016). Predicting the where and what of actors and actions through online action localization. In CVPR.
go back to reference Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
go back to reference Tang, K., Joulin, A., Li, L. J., & Fei-Fei, L. (2014). Co-localization in real-world images. In CVPR. Tang, K., Joulin, A., Li, L. J., & Fei-Fei, L. (2014). Co-localization in real-world images. In CVPR.
go back to reference Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR. Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR.
go back to reference Tian, Y., Sukthankar, R., & Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR. Tian, Y., Sukthankar, R., & Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.
go back to reference Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.MathSciNetMATH Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.MathSciNetMATH
go back to reference Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
go back to reference Tsai, Y. H., Zhong, G., Yang, M. H. (2016). Semantic co-segmentation in videos. In ECCV. Tsai, Y. H., Zhong, G., Yang, M. H. (2016). Semantic co-segmentation in videos. In ECCV.
go back to reference Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
go back to reference Wang, L., Hua, G., Sukthankar, R., Xue, J., & Zheng, N. (2014). Video object discovery and co-segmentation with extremely weak supervision. In ECCV. Wang, L., Hua, G., Sukthankar, R., Xue, J., & Zheng, N. (2014). Video object discovery and co-segmentation with extremely weak supervision. In ECCV.
go back to reference Xiong, C., & Corso, J. J. (2012). Coaction discovery: Segmentation of common actions across multiple videos. In ACM international workshop on multimedia data mining. Xiong, C., & Corso, J. J. (2012). Coaction discovery: Segmentation of common actions across multiple videos. In ACM international workshop on multimedia data mining.
go back to reference Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In CVPR. Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In CVPR.
go back to reference Xu, C., & Corso, J. J. (2016a). Actor–action semantic segmentation with grouping process models. In CVPR. Xu, C., & Corso, J. J. (2016a). Actor–action semantic segmentation with grouping process models. In CVPR.
go back to reference Xu, C., & Corso, J. J. (2016b). LIBSVX: A supervoxel library and benchmark for early video processing. International Journal of Computer Vision, 119(3), 272–290.MathSciNetCrossRef Xu, C., & Corso, J. J. (2016b). LIBSVX: A supervoxel library and benchmark for early video processing. International Journal of Computer Vision, 119(3), 272–290.MathSciNetCrossRef
go back to reference Xu, C., Hsieh, S. H., Xiong, C., & Corso, J. J. (2015). Can humans fly? Action understanding with multiple classes of actors. In CVPR. Xu, C., Hsieh, S. H., Xiong, C., & Corso, J. J. (2015). Can humans fly? Action understanding with multiple classes of actors. In CVPR.
go back to reference Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR. Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
go back to reference Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In ICCV. Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In ICCV.
go back to reference Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(6), 1070–1083.CrossRef Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(6), 1070–1083.CrossRef
go back to reference Yan, Y., Ricci, E., Subramanian, R., Liu, G., & Sebe, N. (2014). Multi-task linear discriminant analysis for multi-view action recognition. IEEE Transactions on Image Processing, 23(12), 5599–5611.MathSciNetCrossRef Yan, Y., Ricci, E., Subramanian, R., Liu, G., & Sebe, N. (2014). Multi-task linear discriminant analysis for multi-view action recognition. IEEE Transactions on Image Processing, 23(12), 5599–5611.MathSciNetCrossRef
go back to reference Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor–action segmentation via robust multi-task ranking. In CVPR. Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor–action segmentation via robust multi-task ranking. In CVPR.
go back to reference Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI conference on artificial intelligence. Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI conference on artificial intelligence.
go back to reference Yu, S., Tresp, V., & Yu, K. (2007). Robust multi-task learning with t-processes. In ICML. Yu, S., Tresp, V., & Yu, K. (2007). Robust multi-task learning with t-processes. In ICML.
go back to reference Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In CVPR. Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In CVPR.
go back to reference Zhang, D., Javed, O., & Shah, M. (2014). Video object co-segmentation by regulated maximum weight cliques. In ECCV. Zhang, D., Javed, O., & Shah, M. (2014). Video object co-segmentation by regulated maximum weight cliques. In ECCV.
go back to reference Zhang, D., Yang, L., Meng, D., & Dong Xu, J. H. (2017). Spftn: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR. Zhang, D., Yang, L., Meng, D., & Dong Xu, J. H. (2017). Spftn: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR.
go back to reference Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015). Semantic object segmentation via detection in weakly labeled video. In CVPR. Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015). Semantic object segmentation via detection in weakly labeled video. In CVPR.
go back to reference Zhang, Y., & Yeung, D. (2010). A convex formulation for learning task relationships in multi-task learning. In Uncertainty in artificial intelligence. Zhang, Y., & Yeung, D. (2010). A convex formulation for learning task relationships in multi-task learning. In Uncertainty in artificial intelligence.
go back to reference Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In ICCV. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In ICCV.
go back to reference Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV. Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV.
go back to reference Zhou, J., Chen, J., & Ye, J. (2011a). Clustered multi-task learning via alternating structure optimization. In NIPS. Zhou, J., Chen, J., & Ye, J. (2011a). Clustered multi-task learning via alternating structure optimization. In NIPS.
Metadata
Title
A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation
Authors
Yan Yan
Chenliang Xu
Dawen Cai
Jason J. Corso
Publication date
28-10-2019
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 5/2020
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-019-01244-7

Other articles of this Issue 5/2020

International Journal of Computer Vision 5/2020 Go to the issue

Premium Partner