Learning automatic concept detectors from online video

doi:10.1016/j.cviu.2009.08.002

Computer Vision and Image Understanding

Volume 114, Issue 4, April 2010, Pages 429-438

https://doi.org/10.1016/j.cviu.2009.08.002 Get rights and content

Abstract

Concept detection is targeted at automatically labeling video content with semantic concepts appearing in it, like objects, locations, or activities. While concept detectors have become key components in many research prototypes for content-based video retrieval, their practical use is limited by the need for large-scale annotated training sets. To overcome this problem, we propose to train concept detectors on material downloaded from web-based video sharing portals like YouTube, such that training is based on tags given by users during upload, no manual annotation is required, and concept detection can scale up to thousands of concepts. On the downside, web video as training material is a complex domain, and the tags associated with it are weak and unreliable. Consequently, performance loss is to be expected when replacing high-quality state-of-the-art training sets with web video content.

This paper presents a concept detection prototype named TubeTagger that utilizes YouTube content for an autonomous training. In quantitative experiments, we compare the performance when training on web video and on standard datasets from the literature. It is demonstrated that concept detection in web video is feasible, and that – when testing on YouTube videos – the YouTube-based detector outperforms the ones trained on standard training sets. By applying the YouTube-based prototype to datasets from the literature, we further demonstrate that: (1) If training annotations on the target domain are available, the resulting detectors significantly outperform the YouTube-based tagger. (2) If no annotations are available, the YouTube-based detector achieves comparable performance to the ones trained on standard datasets (moderate relative performance losses of 11.4% is measured) while offering the advantage of a fully automatic, scalable learning. (3) By enriching conventional training sets with online video material, performance improvements of 11.7% can be achieved when generalizing to domains unseen in training.

Introduction

Modern video retrieval systems employ textual descriptions indicating the presence of semantic concepts in videos, like objects (“airplane”), persons (“Michael Jackson”), locations (“desert”), and activities (“interview”) [37]. However, in many practical situations, such descriptions are not at hand, and a manual labeling is infeasible due to the enormous size of today’s video databases. To overcome this problem, concept detection (or video tagging) systems automatically infer the presence of semantic concepts by applying machine learning techniques to audiovisual features extracted from the video content [7], [5], [48], [49]. Though the accuracy reached by such detectors is far from the one of a thorough manual annotation, the approach is considered a key building block of content-based video retrieval systems [40], as it allows users to browse video collections with textual queries.

For the practical application of concept detection at a large scale, one key problem is that the underlying machine learning techniques require training data annotated with labels indicating the presence of target concepts. Acquiring this information for large databases and concept vocabularies is a time-consuming and expensive process and poses a key challenge for the practical use of concept detection systems. To satisfy this need for large-scale annotated training data, new sources of information need to be investigated. This paper proposes online video portals as such a data source: large-scale web-based databases like YouTube, blinkx.com, and many others have become popular over the last years and allow a growing community of users to share all kinds of video, ranging from TV news and documentaries over movie scenes to home user content like holiday clips or video blogs. For a text-based search, clips are enriched with textual descriptions and tags provided by users during video upload.

For concept detection, online video can be both an application area and an information source: when viewing it as an application area [33], [44], [53], concept detection can offer an improved keyword search and browsing, help to group videos into semantic categories, or support users with labeling their clips to overcome tag incompleteness. As an information source, online video offers a large-scale pool of video content that is dynamically updated and enriched with label information by the web community.

This paper investigates online video both as an application area and as an information source for concept detection. A concept detection prototype named TubeTagger is presented that learns the appearance of semantic concepts autonomously from the video portal YouTube.¹ This overcomes the need for manual annotation and offers two benefits: (1) scalability, i.e. training detectors for thousands of concepts becomes a mere issue of processing power, and (2) flexibility – web video is a dynamic data source that is updated by the web community, allowing concept detectors to keep track as new concepts of interest emerge (like “Obama” or “Olympics 2008”). On the downside, web video comes with an enormous variation of content, and the label information associated with it is weak and unreliable, such that training concept detection systems on it poses a difficult challenge.

Compared to our previous publications [43], [44], the key novelty of this paper are quantitative comparisons of YouTube-based detectors with systems trained on standard data sets. In these experiments, we demonstrate that the automatic tagging of web video is feasible, with a mean average precision of 0.522 reached for a test set of 22 representative concepts. Further, our results show that – when detecting concepts in YouTube clips – training on YouTube videos outperforms conventional concept detectors utilizing manually annotated news video material for training. Beyond this, we also demonstrate that YouTube-based taggers generalize comparably well to domains unseen in training as standard detectors. Finally, we show that by enriching standard datasets with training material from YouTube, performance improvements can be achieved when generalizing to novel data sources.

This paper is organized as follows: the state-of-the-art in concept detection (including methods, datasets, and the use of web video) is discussed in Section 2. After this, web video is introduced as an information source for concept learning, and a concept detection prototype based on this idea is presented (Section 3). Experimental results are given in Section 4, followed by conclusions in Section 5.

Section snippets

Materials and methods

Concept detection is targeted at automatically inferring the presence of semantic concepts (like objects, locations, or activities) from the audiovisual content of a video stream. Given a vocabulary of concepts and an input video X, the task is to estimate concept scores $P (t_{1} | X), \dots, P (t_{n} | X)$ (where $t_{1}, \dots, t_{n}$ are Boolean random variables indicating concept presence). These scores can be used to label videos, or to sort them into a ranked retrieval list for text-based search [40]. Concept detection is

Approach – online video tagging using TubeTagger

As has been outlined, the state-of-the-art approach of training concept detection systems is to manually acquire concept annotations of high quality but limited quantity. While joint community efforts have made it possible to train concept detectors for hundreds of concepts [49], several limitations remain regarding the time and cost associated with such training data acquisition:

(1)
The number of samples required per concept is high due to strong intra-class variation of many concepts. For

Experiments

In this section, two experiments are presented in which the TubeTagger system is trained on real-world online videos downloaded from YouTube. In the first experiment, the system is both trained on and applied to online video content. When used in this setup, TubeTagger is targeted at an automatic content-based indexing and improved search in web video portals. The tagging performance achievable on web video is quantified, and the performance for several kinds of feature pipelines is compared.

Conclusions

In this paper, we have proposed to train concept detection systems on online videos publicly available at a large scale from portals such as YouTube. Thereby, high-quality manual training annotations are replaced with YouTube tags. This setup allows for an autonomous learning, which offers significant advantages in terms of scalability and flexibility by avoiding the tedious task of explicit manual annotation. A concept detection system named TubeTagger has been presented that acquires training

References (54)

A. Smeaton
Techniques used and open challenges to the analysis, indexing retrieval of digital video
Inf. Syst.
(2007)
R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets, in: Proc. Europ. Conf....
A. Amir, M. Berg, S. Chang, W. Hsu, G. Iyengar, C.-Y. Lin, M. Naphade, A. Natsev, C. Neti, H. Nock, J. Smith, B. Tseng,...
H. Bay, T. Tuytelaars, L. van Gool, SURF: speeded up robust features, in: Proc. Europ. Conf. Computer Vision, May 2006,...
D. Borth, A. Ulges, C. Schulze, T. Breuel, Keyframe extraction for video tagging and summarization, in: Proc....
C. Snoek et al., The MediaMill TRECVID 2007 semantic video search engine, in: Proc. TRECVID Workshop (Unreviewed...
C. Snoek et al., The MediaMill TRECVID 2008 semantic video search engine, in: Proc. TRECVID Workshop (Unreviewed...
M. Campbell, A. Haubold, M. Liu, A. Natsev, J. Smith, J. Tesic, L. Xie, R. Yan, J. Yang, IBM research TRECVID-2007...
C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. Software available at...
S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, J. Luo, Large-scale multimodal semantic concept...

S.-F. Chang, W. Hsu, W. Jiang, L. Kennedy, D. Xu, A. Yanagawa, E. Zavesky, Columbia University TRECVID-2006 video...

S.-F. Chang, W. Jiang, A. Yanagawa, E. Zavesky, Columbia University TRECVID2007 high-level feature extraction, in:...

T. Deselaers, L. Pimenidis, H. Ney, Bag-of-visual-words models for adult image classification and filtering, in: Proc....

D. Ding, D. Zhang, Probabilistic model supported rank aggregation for the semantic concept detection in video, in:...

M. Everingham, L. Van Gool, C. Williams, J. Winn, A. Zisserman. The PASCAL visual object classes challenge 2007...

M. Everingham, L. Van Gool, C. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2008...

A. Haubold, M. Naphade, Classification of video events using 4-dimensional time-compressed motion features, in: Proc....

A. Hauptmann, R. Yan, W. Lin, How many high-level concepts will fill the semantic gap in news video retrieval? in:...

C. Hsu, C. Chang, C. Lin, A practical guide to support vector classification, Technical Report, Department of Computer...

M. Huijbregts, R. Ordelman, F. de Jong. Annotation of heterogeneous multimedia content using automatic speech...

W. Jiang, S.-F. Chang, A. Loui, Context-based concept fusion with boosted conditional random fields, in: Proc. Int....

T. Kölsch, Local Features for Image Classification, Diploma thesis, RWTH Aachen,...

W. Kraaij, P. Over, TRECVID-2008 high-level feature task: overview, in: Proc. TRECVID Workshop, November 2008....

X. Li, C. Snoek, M. Worring, Learning tag relevance by neighbor voting for social image retrieval, in: Proc. Int. Conf....

R. Lienhart

Reliable transition detection in videos: a survey and Practitioner’s guide

Int. J. Img. Graph.

(2001)

W.-H. Lin, A. Hauptmann, News video classification using SVM-based multimodal classifiers and combination strategies,...

D. Lowe

Distinctive image features from scale-invariant keypoints

Int. J. Comput. Vis.

(2004)

Cited by (52)

A data-driven approach for tag refinement and localization in web videos
2015, Computer Vision and Image Understanding
Citation Excerpt :
This idea follows similar successful approaches for image classification [33–35] but it has been applied only for the particular case of single-label classification. To this end, a first attempt has been made by Ulges et al. [36] who proposed to train a concept detection system on web videos from portals such as YouTube. A similar idea is presented in [19] in which images collected from the web are used to learn representations of human actions and then this knowledge is used to automatically annotate actions in unconstrained videos.
Tagging of visual content is becoming more and more widespread as web-based services and social networks have popularized tagging functionalities among their users. These user-generated tags are used to ease browsing and exploration of media collections, e.g. using tag clouds, or to retrieve multimedia content. However, not all media are equally tagged by users. Using the current systems is easy to tag a single photo, and even tagging a part of a photo, like a face, has become common in sites like Flickr and Facebook. On the other hand, tagging a video sequence is more complicated and time consuming, so that users just tag the overall content of a video. In this paper we present a method for automatic video annotation that increases the number of tags originally provided by users, and localizes them temporally, associating tags to keyframes. Our approach exploits collective knowledge embedded in user-generated tags and web sources, and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web sources like Google and Bing. Given a keyframe, our method is able to select “on the fly” from these visual sources the training exemplars that should be the most relevant for this test sample, and proceeds to transfer labels across similar images. Compared to existing video tagging approaches that require training classifiers for each tag, our system has few parameters, is easy to implement and can deal with an open vocabulary scenario. We demonstrate the approach on tag refinement and localization on DUT-WEBV, a large dataset of web videos, and show state-of-the-art results.
Recommendations for recognizing video events by concept vocabularies
2014, Computer Vision and Image Understanding
Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work.
Video similarity measurement and search
2019, Studies in Computational Intelligence
Learning Traffic Behaviors by Extracting Vehicle Trajectories from Online Video Streams
2018, IEEE International Conference on Automation Science and Engineering
Video epitomize and eigenvalue generation for web based video retrieval
2018, International Journal of Civil Engineering and Technology
TSCSet: A crowdsourced time-sync comment dataset for exploration of user experience improvement
2018, International Conference on Intelligent User Interfaces, Proceedings IUI

View all citing articles on Scopus

View full text