Top

International Journal of Computer Vision

Published in:

28-10-2019

A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation

Authors: Yan Yan, Chenliang Xu, Dawen Cai, Jason J. Corso

Published in: International Journal of Computer Vision | Issue 5/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Modeling human behaviors and activity patterns has attracted significant research interest in recent years. In order to accurately model human behaviors, we need to perform fine-grained human activity understanding in videos. Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel Schatten p-norm robust multi-task ranking model for weakly-supervised actor–action segmentation where only video-level tags are given for training samples. Our model is able to share useful information among different actors and actions while learning a ranking matrix to select representative supervoxels for actors and actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for video parts. Extensive experimental results on both the actor–action dataset and the Youtube-objects dataset demonstrate that the proposed approach outperforms the state-of-the-art weakly supervised methods and performs as well as the top-performing fully supervised method.

previous article Realistic Speech-Driven Facial Animation with GANs

next article Masked Linear Regression for Learning Local Receptive Fields for Facial Expression Synthesis

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. Technical report. Preprint arXiv:1609.08675.

Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In SIGIR.

Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. In NIPS.

Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.

Brendel, W., & Todorovic, S. (2009). Video object segmentation by tracking regions. In ICCV.

Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In ECCV.

Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In CVPR.

Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In SIGIR.

Chao, Y. W., Wang, Z., Mihalcea, R., & Deng, J. (2015). Mining semantic affordances of visual object categories. In CVPR.

Chen, J., Zhou, J., & Ye, J. (2011). Integrating low-rank and group-sparse structures for robust multi-task learning. In ACM SIGKDD conferences on knowledge discovery and data mining.

Chen, W., & Corso, J. J. (2015). Action detection by implicit intentional motion clustering. In ICCV.

Chiu, W. C., & Fritz, M. (2013). Multi-class video co-segmentation with a generative multi-video model. In CVPR.

Corso, J. J., Sharon, E., Dube, S., El-Saden, S., Sinha, U., & Yuille, A. (2008). Efficient multilevel brain tumor segmentation with integrated Bayesian model classification. IEEE Transactions on Medical Imaging, 27, 629–640.CrossRef

Dang, K., Zhou, C., Tu, Z., Hoy, M., Dauwels, J., & Yuan, J. (2018). Actor action semantic segmentation with region masks. In BMVC.

Delong, A., Osokin, A., Isack, H. N., & Boykov, Y. (2012). Fast approximate energy minimization with label costs. International Journal of Computer Vision, 96(1), 1–27.MathSciNetCrossRef

Deselaers, T., Alexe, B., & Ferrari, V. (2012). Weakly supervised localization and learning with generic knowledge. International Journal of Computer Vision, 100(3), 275–293.MathSciNetCrossRef

Dp, B. (1996). Constrained optimization and lagrange multiplier methods. Belmont: Athena Scientific.

Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. In KDD.

Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.CrossRef

Fu, H., Xu, D., Zhang, B., & Lin, S. (2014). Object-based multiple foreground video co-segmentation. In CVPR.

Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In ICCV.

Gabay, D., & Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers and Mathematics with Applications, 2(1), 17–40.CrossRef

Galasso, F., Cipolla, R., & Schiele, B. (2012). Video segmentation with superpixels. In Asian conference on computer vision.

Gavrilyuk, K., Ghodrati, A., Li, Z., & Snoek, C. G. (2018). Actor and action video segmentation from a sentence. In CVPR.

Geest, R. D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., & Tuytelaars, T. (2016). Online action detection. In ECCV.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.CrossRef

Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR.

Guo, J., Li, Z., Cheong, L. F., & Zhou, S. Z. (2013). Video co-segmentation for meaningful action extraction. In ICCV.

Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.CrossRef

Hartmann, G., Grundmann, M., Hoffman, J., Tsai, D., Kwatra, V., Madani, O., et al. (2012). Weakly supervised learning of object segmentations from web-scale video. In ECCV workshops (pp. 198–208). Berlin: Springer.

Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In IEEE international conference on pattern recognition.

Jacob, L., Bach, F., & Vert, J. (2008). Clustered multi-task learning: A convex formulation. In NIPS.

Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., & Snoek, C., et al. (2014). Action localization with tubelets from motion. In CVPR.

Jain, S., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV.

Jalali, A., Ravikumar, P., Sanghavi, S., & Ruan, C. (2010). A dirty model for multi-task learning. In NIPS.

Ji, J., Buch, S., Soto, A., & Niebles, J. C. (2018). End-to-end joint semantic segmentation of actors and actions in video. In ECCV.

Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD conferences on knowledge discovery and data mining.

Joulin, A., Tang, K., & Fei-Fei, L. (2014). Efficient image and video co-localization with Frank–Wolfe algorithm. In ECCV.

Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In ICCV.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

Krähenbühl, P., & Keltun, V. (2011a). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.

Krähenbühl, P., & Koltun, V. (2011b). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.

Kumar, M., Torr, P., & Zisserman, A. (2005). Learning layered motion segmentations of video. In ICCV.

Kundu, A., Vineet, V., & Koltun, V. (2016). Feature space optimization for semantic video segmentation. In CVPR.

Ladicky, L., Russell, C., Kohli, P., & Torr, P. (2014). Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6), 1056–1077.CrossRef

Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

Lea, C., Reiter, A., Vidal, R., & Hager, G.D. (2016). Segmental spatiotemporal CNNs for fine-grained action segmentation. In ECCV.

Lezama, J., Alahari, K., Josef, S., & Laptev, I. (2011). Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR.

Lin, G., Shen, C., van den Hengel, A., & Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In CVPR.

Liu, B., & He, X. (2015). Multiclass semantic video segmentation with object-level active inference. In CVPR.

Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.CrossRef

Liu, X., Tao, D., Song, M., Ruan, Y., Chen, C., & Bu, J. (2014). Weakly supervised multiclass video segmentation. In CVPR.

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

Lu, J., Xu, R., & Corso, J. J. (2015). Human action segmentation with hierarchical supervoxel consistency. In CVPR.

Luo, Y., Tao, D., Geng, B., Xu, C., & Maybank, S. (2013). Manifold regularized multitask learning for semi-supervised multilabel image classification. IEEE Transactions on Transactions on Pattern Recognition and Machine Intelligence, 22(2), 523–536.MathSciNetMATH

Mettes, P., van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In ECCV.

Mosabbeb, E. A., Cabral, R., De la Torre, F., & Fathy, M. (2014). Multi-label discriminative weakly-supervised human activity recognition and localization. In Asian conference on computer vision.

Parikh, N., & Boyd, S. (2013). Proximal algorithms. Foundations and Trends \({}^{\textregistered }\) in Optimization, 1(3), 127–239.

Paris, S. (2008). Edge-preserving smoothing and mean-shift segmentation of video streams. In ECCV.

Peng, X., & Schmid, C. (2016). Multi-region two-stream R-CNN for action detection. In ECCV.

Pinto, L., Gandhi, D., Han, Y., Park, Y. L., & Gupta, A. (2016). The curious robot: Learning visual representations via physical interactions. In ECCV.

Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR.

Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR.

Ryoo, M. S., & Aggarwal, J. K. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.

Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In CVPR.

Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In IEEE international conference on pattern recognition.

Sculley, D. (2010). Combined regression and ranking. In KDD.

Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR.

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

Song, Y. C., Naim, I., Al Mamun, A., Kulkarni, K., Singla, P., Luo, J., Gildea, D., & Kautz, H. (2016). Unsupervised alignment of actions in video with text descriptions. In International joint conference on artificial intelligence.

Soomro, K., Idrees, H., & Shah, M. (2016). Predicting the where and what of actors and actions through online action localization. In CVPR.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

Tang, K., Joulin, A., Li, L. J., & Fei-Fei, L. (2014). Co-localization in real-world images. In CVPR.

Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR.

Tian, Y., Sukthankar, R., & Shah, M. (2013). Spatiotemporal deformable part models for action detection. In CVPR.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.MathSciNetMATH

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

Tsai, Y. H., Zhong, G., Yang, M. H. (2016). Semantic co-segmentation in videos. In ECCV.

Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

Wang, L., Hua, G., Sukthankar, R., Xue, J., & Zheng, N. (2014). Video object discovery and co-segmentation with extremely weak supervision. In ECCV.

Xiong, C., & Corso, J. J. (2012). Coaction discovery: Segmentation of common actions across multiple videos. In ACM international workshop on multimedia data mining.

Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In CVPR.

Xu, C., & Corso, J. J. (2016a). Actor–action semantic segmentation with grouping process models. In CVPR.

Xu, C., & Corso, J. J. (2016b). LIBSVX: A supervoxel library and benchmark for early video processing. International Journal of Computer Vision, 119(3), 272–290.MathSciNetCrossRef

Xu, C., Hsieh, S. H., Xiong, C., & Corso, J. J. (2015). Can humans fly? Action understanding with multiple classes of actors. In CVPR.

Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In CVPR.

Yan, Y., Ricci, E., Subramanian, R., Lanz, O., & Sebe, N. (2013). No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In ICCV.

Yan, Y., Ricci, E., Subramanian, R., Liu, G., Lanz, O., & Sebe, N. (2016). A multi-task learning framework for head pose estimation under target motion. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(6), 1070–1083.CrossRef

Yan, Y., Ricci, E., Subramanian, R., Liu, G., & Sebe, N. (2014). Multi-task linear discriminant analysis for multi-view action recognition. IEEE Transactions on Image Processing, 23(12), 5599–5611.MathSciNetCrossRef

Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor–action segmentation via robust multi-task ranking. In CVPR.

Yang, Y., Li, Y., Fermüller, C., & Aloimonos, Y. (2015). Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In AAAI conference on artificial intelligence.

Yu, S., Tresp, V., & Yu, K. (2007). Robust multi-task learning with t-processes. In ICML.

Yuan, J., Ni, B., Yang, X., & Kassim, A. A. (2016). Temporal action localization with pyramid of score distribution features. In CVPR.

Zhang, D., Javed, O., & Shah, M. (2014). Video object co-segmentation by regulated maximum weight cliques. In ECCV.

Zhang, D., Yang, L., Meng, D., & Dong Xu, J. H. (2017). Spftn: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR.

Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015). Semantic object segmentation via detection in weakly labeled video. In CVPR.

Zhang, Y., & Yeung, D. (2010). A convex formulation for learning task relationships in multi-task learning. In Uncertainty in artificial intelligence.

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In ICCV.

Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV.

Zhou, J., Chen, J., & Ye, J. (2011a). Clustered multi-task learning via alternating structure optimization. In NIPS.

Zhou, J., Chen, J., & Ye, J. (2011b). MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University. http://www.public.asu.edu/~jye02/Software/MALSAR

Title: A Weakly Supervised Multi-task Ranking Framework for Actor–Action Semantic Segmentation
Authors: Yan Yan
Chenliang Xu
Dawen Cai
Jason J. Corso
Publication date: 28-10-2019
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 5/2020
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-019-01244-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 5/2020

Realistic Speech-Driven Facial Animation with GANs

Efficient Object Annotation via Speaking and Pointing

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Correction to: Model-Based Robot Imitation with Future Image Similarity

The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline

Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities

Premium Partner