ABSTRACT
Multimedia event detection (MED) is an effective technique for video indexing and retrieval. Current classifier training for MED treats the negative videos equally. However, many negative videos may resemble the positive videos in different degrees. Intuitively, we may capture more informative cues from the negative videos if we assign them fine-grained labels, thus benefiting the classifier learning. Aiming for this, we use a statistical method on both the positive and negative examples to get the decisive attributes of a specific event. Based on these decisive attributes, we assign the fine-grained labels to negative examples to treat them differently for more effective exploitation. The resulting fine-grained labels may be not accurate enough to characterize the negative videos. Hence, we propose to jointly optimize the fine-grained labels with the knowledge from the visual features and the attributes representations, which brings mutual reciprocality. Our model obtains two kinds of classifiers, one from the attributes and one from the features, which incorporate the informative cues from the fine-grained labels. The outputs of both classifiers on the testing videos are fused for detection. Extensive experiments on the challenging TRECVID MED 2012 development set have validated the efficacy of our proposed approach.
- http://www-nlpir.nist.gov/projects/tv2012/tv2012.html#sin.Google Scholar
- http://www.nist.gov/itl/iad/mig/upload/med11-evalplan-v03--20110801a.pdf.Google Scholar
- A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):555--560, 2008. Google ScholarDigital Library
- A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41--48, 2006.Google ScholarDigital Library
- L. Bao, L. Zhang, S.-I. Yu, Z. zhong Lan, L. Jiang, A. Overwijk, Q. Jin, S. Takahashi, B. Langner, Y. Li, M. Garbus, S. Burger, F. Metze, and A. Hauptmann. Informedia @ TRECVID2011. In NIST TRECVID Workshop, 2011.Google Scholar
- L. Cao, S.-F. Chang, N. Codella, C. Cotton, D. Ellis, L. Gong, M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, A. Natsev, and J. R. Smith. IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System. In NIST TRECVID Workshop, 2011.Google Scholar
- S. Dhar, V. Ordonez, and T. L. Berg. High level describable attributes for predicting aesthetics and interestingness. In CVPR, pages 1657--1664, 2011. Google ScholarDigital Library
- D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. G. Hauptmann. Beyond audio and video retrieval: towards multimedia summarization. In ICMR, 2012. Google ScholarDigital Library
- A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR, pages 1778--1785, 2009.Google ScholarCross Ref
- A. G. Hauptmann. Lessons for the future from a decade of informedia video analysis research. In CIVR, pages 1--10, 2005. Google ScholarDigital Library
- A. G. Hauptmann, R. Yan, W.-H. Lin, M. G. Christel, and H. D. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5):958--966, 2007. Google ScholarDigital Library
- S. J. Hwang, F. Sha, and K. Grauman. Sharing features between objects and their attributes. In CVPR, pages 1761--1768, 2011. Google ScholarDigital Library
- A. Jalali, P. D. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, pages 964--972, 2010.Google Scholar
- Y.-G. Jiang, J. Yang, C.-W. Ngo, and A. G. Hauptmann. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia, 12(1):42--53, 2010. Google ScholarDigital Library
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, pages 2556--2563, 2011. Google ScholarDigital Library
- I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91--110, 2004. Google ScholarDigital Library
- Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM Multimedia, pages 469--478, 2012. Google ScholarDigital Library
- Z. Ma, Y. Yang, N. Sebe, K. Zheng, and A. G. Hauptmann. Multimedia event detection using a classifier-specific intermediate representation. IEEE Transactions on Multimedia, 2013.Google ScholarDigital Library
- Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann. Complex event detection via multi-source video attributes. In CVPR, 2013. Google ScholarDigital Library
- P. Natarajan, S. Wu, S. N. P. Vitaladevuni, X. Zhuang, U. Park, R. Prasad, and P. Natarajan. Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval. In ECCV (2), pages 301--314, 2012. Google ScholarDigital Library
- P. Natarajan, S. Wu, S. N. P. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan. Multimodal feature fusion for robust event detection in web videos. In CVPR, pages 1298--1305, 2012. Google ScholarDigital Library
- B. Schölkopf, A. J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299--1319, 1998. Google ScholarDigital Library
- F. Wang, Y.-G. Jiang, and C.-W. Ngo. Video event detection using motion relativity and visual relatedness. In ACM Multimedia, pages 239--248, 2008. Google ScholarDigital Library
- G. Wang, T.-S. Chua, and M. Zhao. Exploring knowledge of sub-domain in a multi-resolution bootstrapping framework for concept detection in news video. In ACM Multimedia, pages 249--258, 2008. Google ScholarDigital Library
- G. Wang and D. A. Forsyth. Joint learning of visual attributes, object classes and visual saliency. In ICCV, pages 537--544, 2009.Google ScholarCross Ref
- Y. Wang and G. Mori. A discriminative latent model of object classes and attributes. In ECCV (5), pages 155--168, 2010. Google ScholarDigital Library
- C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan. Live sports event detection based on broadcast video and web-casting text. In ACM Multimedia, pages 221--230, 2006. Google ScholarDigital Library
- Y. Yang and M. Shah. Complex events detection using data-driven concepts. In ECCV (3), pages 722--735, 2012. Google ScholarDigital Library
- Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. l$_\mbox2, 1$-norm regularized discriminative feature selection for unsupervised learning. In IJCAI, pages 1589--1594, 2011. Google ScholarDigital Library
- Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, and A. G. Hauptmann. Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Transactions on Multimedia, 15(3):572--581, 2013.Google ScholarDigital Library
- S.-I. Yu, Z. Xu, D. Ding, W. Sze, F. Vicente, Z. Lan, Y. Cai, S. Rawat, P. Schulam, N. Markandaiah, S. Bahmani, A. Juarez, W. Tong, Y. Yang, S. Burger, F. Metze, R. Singh, B. Raj, R. Stern, T. Mitamura, E. Nyberg, and A. Hauptmann. Informedia e-lamp @ TRECVID2012: Multimedia event detection and recounting med and mer. In NIST TRECVID Workshop, 2012.Google Scholar
- J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In NIPS, pages 702--710, 2011.Google ScholarDigital Library
Index Terms
- We are not equally negative: fine-grained labeling for multimedia event detection
Recommendations
Fine-Grained Label Learning via Siamese Network for Cross-modal Information Retrieval
Computational Science – ICCS 2019AbstractCross-modal information retrieval aims to search for semantically relevant data from various modalities when given a query from one modality. For text-image retrieval, a common solution is to map texts and images into a common semantic space and ...
Event Bank based multimedia representation via latent group logistic regression minimization
In order to perform multimedia event detection (MED) tasks in uncontrolled videos, a very large number of labeled videos are required for training the event classifier, which would become quite challenging especially when there are lots of events. ...
Semisupervised Learning Using Negative Labels
The problem of semisupervised learning has aroused considerable research interests in the past few years. Most of these methods aim to learn from a partially labeled dataset, i.e., they assume that the exact labels of some data are already known. In ...
Comments