research-article

We are not equally negative: fine-grained labeling for multimedia event detection

Authors:
Zhigang Ma

University of Trento, Trento, Italy

University of Trento, Trento, Italy
View Profile

,
Yi Yang

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Zhongwen Xu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Nicu Sebe

University of Trento, Trento, Italy

University of Trento, Trento, Italy
View Profile

,
Alexander G. Hauptmann

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

MM '13: Proceedings of the 21st ACM international conference on MultimediaOctober 2013Pages 293–302https://doi.org/10.1145/2502081.2502119

Published:21 October 2013Publication History

MM '13: Proceedings of the 21st ACM international conference on Multimedia

Pages 293–302

ABSTRACT

Multimedia event detection (MED) is an effective technique for video indexing and retrieval. Current classifier training for MED treats the negative videos equally. However, many negative videos may resemble the positive videos in different degrees. Intuitively, we may capture more informative cues from the negative videos if we assign them fine-grained labels, thus benefiting the classifier learning. Aiming for this, we use a statistical method on both the positive and negative examples to get the decisive attributes of a specific event. Based on these decisive attributes, we assign the fine-grained labels to negative examples to treat them differently for more effective exploitation. The resulting fine-grained labels may be not accurate enough to characterize the negative videos. Hence, we propose to jointly optimize the fine-grained labels with the knowledge from the visual features and the attributes representations, which brings mutual reciprocality. Our model obtains two kinds of classifiers, one from the attributes and one from the features, which incorporate the informative cues from the fine-grained labels. The outputs of both classifiers on the testing videos are fused for detection. Extensive experiments on the challenging TRECVID MED 2012 development set have validated the efficacy of our proposed approach.

References

http://www-nlpir.nist.gov/projects/tv2012/tv2012.html#sin.Google Scholar
http://www.nist.gov/itl/iad/mig/upload/med11-evalplan-v03--20110801a.pdf.Google Scholar
A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):555--560, 2008. Google ScholarDigital Library
A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41--48, 2006.Google ScholarDigital Library
L. Bao, L. Zhang, S.-I. Yu, Z. zhong Lan, L. Jiang, A. Overwijk, Q. Jin, S. Takahashi, B. Langner, Y. Li, M. Garbus, S. Burger, F. Metze, and A. Hauptmann. Informedia @ TRECVID2011. In NIST TRECVID Workshop, 2011.Google Scholar
L. Cao, S.-F. Chang, N. Codella, C. Cotton, D. Ellis, L. Gong, M. Hill, G. Hua, J. Kender, M. Merler, Y. Mu, A. Natsev, and J. R. Smith. IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System. In NIST TRECVID Workshop, 2011.Google Scholar
S. Dhar, V. Ordonez, and T. L. Berg. High level describable attributes for predicting aesthetics and interestingness. In CVPR, pages 1657--1664, 2011. Google ScholarDigital Library
D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. G. Hauptmann. Beyond audio and video retrieval: towards multimedia summarization. In ICMR, 2012. Google ScholarDigital Library
A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR, pages 1778--1785, 2009.Google ScholarCross Ref
A. G. Hauptmann. Lessons for the future from a decade of informedia video analysis research. In CIVR, pages 1--10, 2005. Google ScholarDigital Library
A. G. Hauptmann, R. Yan, W.-H. Lin, M. G. Christel, and H. D. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9(5):958--966, 2007. Google ScholarDigital Library
S. J. Hwang, F. Sha, and K. Grauman. Sharing features between objects and their attributes. In CVPR, pages 1761--1768, 2011. Google ScholarDigital Library
A. Jalali, P. D. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, pages 964--972, 2010.Google Scholar
Y.-G. Jiang, J. Yang, C.-W. Ngo, and A. G. Hauptmann. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia, 12(1):42--53, 2010. Google ScholarDigital Library
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, pages 2556--2563, 2011. Google ScholarDigital Library
I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91--110, 2004. Google ScholarDigital Library
Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. G. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM Multimedia, pages 469--478, 2012. Google ScholarDigital Library
Z. Ma, Y. Yang, N. Sebe, K. Zheng, and A. G. Hauptmann. Multimedia event detection using a classifier-specific intermediate representation. IEEE Transactions on Multimedia, 2013.Google ScholarDigital Library
Z. Ma, Y. Yang, Z. Xu, S. Yan, N. Sebe, and A. G. Hauptmann. Complex event detection via multi-source video attributes. In CVPR, 2013. Google ScholarDigital Library
P. Natarajan, S. Wu, S. N. P. Vitaladevuni, X. Zhuang, U. Park, R. Prasad, and P. Natarajan. Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval. In ECCV (2), pages 301--314, 2012. Google ScholarDigital Library
P. Natarajan, S. Wu, S. N. P. Vitaladevuni, X. Zhuang, S. Tsakalidis, U. Park, R. Prasad, and P. Natarajan. Multimodal feature fusion for robust event detection in web videos. In CVPR, pages 1298--1305, 2012. Google ScholarDigital Library
B. Schölkopf, A. J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299--1319, 1998. Google ScholarDigital Library
F. Wang, Y.-G. Jiang, and C.-W. Ngo. Video event detection using motion relativity and visual relatedness. In ACM Multimedia, pages 239--248, 2008. Google ScholarDigital Library
G. Wang, T.-S. Chua, and M. Zhao. Exploring knowledge of sub-domain in a multi-resolution bootstrapping framework for concept detection in news video. In ACM Multimedia, pages 249--258, 2008. Google ScholarDigital Library
G. Wang and D. A. Forsyth. Joint learning of visual attributes, object classes and visual saliency. In ICCV, pages 537--544, 2009.Google ScholarCross Ref
Y. Wang and G. Mori. A discriminative latent model of object classes and attributes. In ECCV (5), pages 155--168, 2010. Google ScholarDigital Library
C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan. Live sports event detection based on broadcast video and web-casting text. In ACM Multimedia, pages 221--230, 2006. Google ScholarDigital Library
Y. Yang and M. Shah. Complex events detection using data-driven concepts. In ECCV (3), pages 722--735, 2012. Google ScholarDigital Library
Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. l$_\mbox2, 1$-norm regularized discriminative feature selection for unsupervised learning. In IJCAI, pages 1589--1594, 2011. Google ScholarDigital Library
Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, and A. G. Hauptmann. Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Transactions on Multimedia, 15(3):572--581, 2013.Google ScholarDigital Library
S.-I. Yu, Z. Xu, D. Ding, W. Sze, F. Vicente, Z. Lan, Y. Cai, S. Rawat, P. Schulam, N. Markandaiah, S. Bahmani, A. Juarez, W. Tong, Y. Yang, S. Burger, F. Metze, R. Singh, B. Raj, R. Stern, T. Mitamura, E. Nyberg, and A. Hauptmann. Informedia e-lamp @ TRECVID2012: Multimedia event detection and recounting med and mer. In NIST TRECVID Workshop, 2012.Google Scholar
J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In NIPS, pages 702--710, 2011.Google ScholarDigital Library

Index Terms

We are not equally negative: fine-grained labeling for multimedia event detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Fine-Grained Label Learning via Siamese Network for Cross-modal Information Retrieval
Computational Science – ICCS 2019
Abstract
Cross-modal information retrieval aims to search for semantically relevant data from various modalities when given a query from one modality. For text-image retrieval, a common solution is to map texts and images into a common semantic space and ...
Read More
Event Bank based multimedia representation via latent group logistic regression minimization

In order to perform multimedia event detection (MED) tasks in uncontrolled videos, a very large number of labeled videos are required for training the event classifier, which would become quite challenging especially when there are lots of events. ...
Read More
Semisupervised Learning Using Negative Labels

The problem of semisupervised learning has aroused considerable research interests in the past few years. Most of these methods aim to learn from a partially labeled dataset, i.e., they assume that the exact labels of some data are already known. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '13: Proceedings of the 21st ACM international conference on Multimedia
October 2013
1166 pages
ISBN:9781450324045
DOI:10.1145/2502081
General Chairs:
Alejandro (Alex) Jaimes
Yahoo!, Spain
,
Nicu Sebe
University of Trento, Italy
,
Nozha Boujemaa
INRIA, France
,
Program Chairs:
Daniel Gatica-Perez
IDIAP & EPFL, Switzerland
,
David A. Shamma
Yahoo!, USA
,
Marcel Worring
University of Amsterdam, The Netherlands
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
attribute representation
attribute selection
fine-grained labeling
multi-source attributes
multimedia event detection (MED)
Qualifiers
- research-article
Conference

Acceptance Rates
MM '13 Paper Acceptance Rate47of235submissions,20%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 559
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

We are not equally negative: fine-grained labeling for multimedia event detection

MM '13: Proceedings of the 21st ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine-Grained Label Learning via Siamese Network for Cross-modal Information Retrieval

Event Bank based multimedia representation via latent group logistic regression minimization

Semisupervised Learning Using Negative Labels