nach oben

International Journal of Computer Vision

Erschienen in:

01.06.2016

Exploiting Privileged Information from Web Data for Action and Event Recognition

verfasst von: Li Niu, Wen Li, Dong Xu

Erschienen in: International Journal of Computer Vision | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the conventional approaches for action and event recognition, sufficient labelled training videos are generally required to learn robust classifiers with good generalization capability on new testing videos. However, collecting labelled training videos is often time consuming and expensive. In this work, we propose new learning frameworks to train robust classifiers for action and event recognition by using freely available web videos as training data. We aim to address three challenging issues: (1) the training web videos are generally associated with rich textual descriptions, which are not available in test videos; (2) the labels of training web videos are noisy and may be inaccurate; (3) the data distributions between training and test videos are often considerably different. To address the first two issues, we propose a new framework called multi-instance learning with privileged information (MIL-PI) together with three new MIL methods, in which we not only take advantage of the additional textual descriptions of training web videos as privileged information, but also explicitly cope with noise in the loose labels of training web videos. When the training and test videos come from different data distributions, we further extend our MIL-PI as a new framework called domain adaptive MIL-PI. We also propose another three new domain adaptation methods, which can additionally reduce the data distribution mismatch between training and test videos. Comprehensive experiments for action and event recognition demonstrate the effectiveness of our proposed approaches.

Vorheriger Artikel Kernelized Multiview Projection for Robust Action Recognition

Nächster Artikel Fusing Features and Local Features with Context-Aware Kernels for Action Recognition

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

The work in Li et al. (2011) used both visual and textual features in the training process. However, it also requires the textual features in the testing process.

The bias term \(\hat{b}\) and the scalar terms \(\rho \) and \(\frac{1}{\Vert \mathbf {v}\Vert }\) will not change the trend of functions.

Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3), 16.CrossRef

Andrews, S., Tsochantaridis, I., & Hofmann, T. (2003). Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems (NIPS) (pp. 561–568).

Baktashmotlagh, M., Harandi, M., & Brian Lovell, M. S. (2013). Unsupervised domain adaptation by domain invariant projection. In IEEE International Conference on Computer Vision (ICCV) (pp. 769–776).

Bergamo, A., & Torresani, L. (2010). Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS) (pp. 181–189).

Bobick, A. F. (1997). Movement, activity and action: The role of knowledge in the perception of motion. Philosophical Transactions of the Royal Society B: Biological Sciences, 352(1358), 1257–1265.CrossRef

Bootkrajang, J., & Kabán, A. (2014). Learning kernel logistic regression in the presence of class label noise. Pattern Recognition, 47(11), 3641–3655.CrossRef

Bruzzone, L., & Marconcini, M. (2010). Domain adaptation problems: A DASVM classification technique and a circular validation strategy. T-PAMI, 32(5), 770–787.CrossRef

Bunescu, R. C., & Mooney, R. J. (2007). Multiple instance learning for sparse positive bags. In International Conference on Machine learning (ICML) (pp. 105–112).

Chang, S. F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A. C., & Luo, J. (2007). Large-scale multimodal semantic concept detection for consumer video. In International Workshop on Multimedia Information Retrieval (pp. 255–264).

Chen, L., Duan, L., & Xu, D. (2013a) Event recognition in videos by learning from heterogeneous web sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2666–2673).

Chen, X., Shrivastava, A., & Gupta, A. (2013b) NEIL: Extracting visual knowledge from web data. In IEEE International Conference on Computer Vision (ICCV) (pp. 1409–1416).

Chen, Y., Bi, J., & Wang, J. Z. (2006). MILES: Multiple-instance learning via embedded instance selection. T-PAMI, 28(12), 1931–1947.CrossRef

Chu, W. S., DelaTorre, F., & Cohn, J. (2013) Selective transfer machine for personalized facial action unit detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3515–3522).

Duan, L., Li, W., Tsang, I. W., & Xu, D. (2011). Improving web image search by bag-based re-ranking. T-IP, 20(11), 3280–3290.MathSciNetCrossRef

Duan, L., Tsang, I. W., & Xu, D. (2012a). Domain transfer multiple kernel learning. T-PAMI, 34(3), 465–479.CrossRef

Duan, L., Xu, D., & Chang, S. F. (2012b). Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1338–1345).

Duan, L., Xu, D., & Tsang, I. W. (2012c). Domain adaptation from multiple sources: A domain-dependent regularization approach. T-NNLS, 23(3), 504–518.

Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012d). Visual event recognition in videos by learning from web data. T-PAMI, 34(9), 1667–1680.CrossRef

Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1778–1785).

Farquhar, J. D. R., Hardoon, D. R., Meng, H., Shawe-Taylor, J., & Szedmak, S. (2005). Two view learning: SVM-2K, theory and practice. In NIPS.

Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In ICCV.

Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV.

Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In Advances in Neural Information Processing Systems (NIPS) (pp. 433–440).

Fouad, S., Tino, P., Raychaudhury, S., & Schneider, P. (2013). Incorporating privileged information through metric learning. T-NNLS, 24(7), 1086–1098.

Gehler, P. V., & Nowozin, S. (2008). Infinite kernel learning.Tech. rep., Max Planck Institute for Biological Cybernetics. In NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels.

Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2066–2073).

Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In IEEE International Conference on Computer Vision (ICCV) (pp. 999–1006).

Gretton, A., Rasch, K. M., Schlkopf, B., & Smola, A. (2012). A kernel two-sample test. JMLR, 13, 723–773.MathSciNetMATH

Hardoon, D. R., Szedmak, S., & Shawe-taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.CrossRefMATH

Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., & Huang, T. S. (2009). Action detection in complex scenes with spatial and temporal ambiguities. In IEEE International Conference on Computer Vision (ICCV) (pp. 128–135).

Huang, J., Smola, A., Gretton, A., Borgwardt, K., & Scholkopf, B. (2007). Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems (NIPS) (pp. 601–608).

Hwang, S. J., & Grauman, K. (2012). Learning the relative importance of objects from tagged images for retrieval and cross-modal search. IJCV, 100(2), 134–153.MathSciNetCrossRef

Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In International Conference on Multimedia Retrieval (ICMR) (p. 29).

Jiang, Y. G., Bhattacharya, S., Chang, S. F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.CrossRef

Kloft, M., Brefeld, U., Sonnenburg, S., & Zien, A. (2011). \({\ell }_\text{ p }\)-norm multiple kernel learning. JMLR, 12, 953–997.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision (ICCV) (pp. 2556–2563).

Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1785–1792).

Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A.Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3361–3368).

Leung, T., Song, Y., & Zhang, J. (2011). Handling label noise in video classification via multiple instance learning. In IEEE International Conference on Computer Vision (ICCV) (pp. 2056–2063).

Li, Q., Wu, J., & Tu, Z. (2013). Harvesting mid-level visual concepts from large-scale Internet images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 851–858).

Li, W., Duan, L., Xu, D., & Tsang, I. W. (2011). Text-based image retrieval using progressive multi-instance learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2368–2375).

Li, W., Duan, L., Tsang, I.W., & Xu, D. (2012a). Batch mode adaptive multiple instance learning for computer vision tasks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2368–2375).

Li, W., Duan, L., Tsang, I.W., & Xu, D. (2012b). Co-labeling: A new multi-view learning approach for ambiguous problems. In IEEE International Conference on Data Mining (ICDM) (pp. 419–428).

Li, W., Duan, L., Xu, D., & Tsang, I. W. (2014a). Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. T-PAMI, 36(6), 1134–1148.CrossRef

Li, W., Niu, L., & Xu, D. (2014b). Exploiting privileged information from web data for image categorization. In European Conference on Computer Vision (ECCV) (pp. 437–452).

Li, Y.-F., Tsang, I. W., Kwok, J. T., & Zhou, Z.-H. (2009). Tighter and convex maximum margin clustering. In International Conference on Artificial Intelligence and Statistics (pp. 344–351).

Liang, L., Cai, F., & Cherkassky, V. (2009). Predictive learning with structured (grouped) data. Neural Networks, 22, 766–773.CrossRefMATH

Lin, Z., Jiang, Z., & Davis, L. S. (2009). Recognizing actions by shape-motion prototype trees. In IEEE International Conference on Computer Vision (ICCV) (pp. 444–451).

Loui, A., Luo, J., Chang, S. F., Ellis, D., Jiang, W., Kennedy, L., Lee, K., & Yanagawa, A. (2007). Kodak’s consumer video benchmark data set: concept definition and annotation. In International Workshop on Multimedia Information Retrieval (pp. 245–254).

Morariu, V.I., & Davis, L.S. (2011). Multi-agent event recognition in structured scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3289–3296).

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. In Advances in Neural Information Processing Systems, pp 1196–1204.

Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2011). Domain adaptation via transfer component analysis. T-NN, 22(2), 199–210.

Schroff, F., Criminisi, A., & Zisserman, A. (2011). Harvesting image databases from the web. T-PAMI, 33(4), 754–766.CrossRef

Sharmanska, V., Quadrianto, N., Lampert, C. H. (2013). Learning to rank using privileged information. In IEEE International Conference on Computer Vision (ICCV) (pp. 825–832).

Shi, Y., Huang, Y., Minnen, D., Bobick, A., & Essa, I. (2004). Propagation networks for recognition of partially ordered sequential action. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (vol. 2, pp. II-862–II-869).

Torralba, A., & Efros, A.A. (2011). Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1521–1528).

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. T-PAMI, 30(11), 1958–1970.CrossRef

Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In European Conference on Computer Vision (ECCV) (pp. 776–789).

Tran, S. D., & Davis, L. S. (2008). Event modeling and recognition using markov logic networks. In European Conference on Computer Vision (ECCV) (pp. 610–623).

Vapnik, V., & Vashist, A. (2009). A new learning paradigm: Learning using privileged infromatin. Neural Networks, 22, 544–557.CrossRefMATH

Vijayanarasimhan, S., & Grauman, K. (2008). Keywords to visual categories: Multiple-instance learning for weakly supervised object categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8).

Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV) (pp. 3551–3558).

Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2011a). Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3169–3176).

Wang, L., Wang, Y., & Gao, W. (2011b). Mining layered grammar rules for action recognition. International Journal of Computer Vision, 93(2), 162–182.

Xu, D., & Chang, S. F. (2008). Video event recognition using kernel methods with multilevel temporal alignment. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(11), 1985–1997.

Yu, T. H., Kim, T.K., & Cipolla, R. (2010). Real-time action recognition by spatiotemporal semantic and structural forests. In The British Machine Vision Conference (BMVC) (p. 52.1–52.12).

Zeng, Z., & Ji, Q. (2010). Knowledge based activity recognition with dynamic bayesian network. In European Conference on Computer Vision (ECCV) (pp. 532–546).

Zhou, Z., & Zhang, M. (2006). Multi-instance multi-label learning with application to scene classification. In Advances in neural information processing systems (NIPS) (pp. 1609–1616).

Zhu, G., Yang, M., Yu, K., Xu, W., & Gong, Y. (2009). Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor. In Proceedings of the 17th ACM international conference on Multimedia (pp. 165–174). ACM.

Titel: Exploiting Privileged Information from Web Data for Action and Event Recognition
verfasst von: Li Niu
Wen Li
Dong Xu
Publikationsdatum: 01.06.2016
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 2/2016
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-015-0862-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2016

Gaze Estimation in the 3D Space Using RGB-D Sensors

Fusing Features and Local Features with Context-Aware Kernels for Action Recognition

Guest Editorial: Human Activity Understanding from 2D and 3D Data

A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition

Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation

Multi-modal RGB–Depth–Thermal Human Body Segmentation

Premium Partner