Learning automatic concept detectors from online video
Introduction
Modern video retrieval systems employ textual descriptions indicating the presence of semantic concepts in videos, like objects (“airplane”), persons (“Michael Jackson”), locations (“desert”), and activities (“interview”) [37]. However, in many practical situations, such descriptions are not at hand, and a manual labeling is infeasible due to the enormous size of today’s video databases. To overcome this problem, concept detection (or video tagging) systems automatically infer the presence of semantic concepts by applying machine learning techniques to audiovisual features extracted from the video content [7], [5], [48], [49]. Though the accuracy reached by such detectors is far from the one of a thorough manual annotation, the approach is considered a key building block of content-based video retrieval systems [40], as it allows users to browse video collections with textual queries.
For the practical application of concept detection at a large scale, one key problem is that the underlying machine learning techniques require training data annotated with labels indicating the presence of target concepts. Acquiring this information for large databases and concept vocabularies is a time-consuming and expensive process and poses a key challenge for the practical use of concept detection systems. To satisfy this need for large-scale annotated training data, new sources of information need to be investigated. This paper proposes online video portals as such a data source: large-scale web-based databases like YouTube, blinkx.com, and many others have become popular over the last years and allow a growing community of users to share all kinds of video, ranging from TV news and documentaries over movie scenes to home user content like holiday clips or video blogs. For a text-based search, clips are enriched with textual descriptions and tags provided by users during video upload.
For concept detection, online video can be both an application area and an information source: when viewing it as an application area [33], [44], [53], concept detection can offer an improved keyword search and browsing, help to group videos into semantic categories, or support users with labeling their clips to overcome tag incompleteness. As an information source, online video offers a large-scale pool of video content that is dynamically updated and enriched with label information by the web community.
This paper investigates online video both as an application area and as an information source for concept detection. A concept detection prototype named TubeTagger is presented that learns the appearance of semantic concepts autonomously from the video portal YouTube.1 This overcomes the need for manual annotation and offers two benefits: (1) scalability, i.e. training detectors for thousands of concepts becomes a mere issue of processing power, and (2) flexibility – web video is a dynamic data source that is updated by the web community, allowing concept detectors to keep track as new concepts of interest emerge (like “Obama” or “Olympics 2008”). On the downside, web video comes with an enormous variation of content, and the label information associated with it is weak and unreliable, such that training concept detection systems on it poses a difficult challenge.
Compared to our previous publications [43], [44], the key novelty of this paper are quantitative comparisons of YouTube-based detectors with systems trained on standard data sets. In these experiments, we demonstrate that the automatic tagging of web video is feasible, with a mean average precision of 0.522 reached for a test set of 22 representative concepts. Further, our results show that – when detecting concepts in YouTube clips – training on YouTube videos outperforms conventional concept detectors utilizing manually annotated news video material for training. Beyond this, we also demonstrate that YouTube-based taggers generalize comparably well to domains unseen in training as standard detectors. Finally, we show that by enriching standard datasets with training material from YouTube, performance improvements can be achieved when generalizing to novel data sources.
This paper is organized as follows: the state-of-the-art in concept detection (including methods, datasets, and the use of web video) is discussed in Section 2. After this, web video is introduced as an information source for concept learning, and a concept detection prototype based on this idea is presented (Section 3). Experimental results are given in Section 4, followed by conclusions in Section 5.
Section snippets
Materials and methods
Concept detection is targeted at automatically inferring the presence of semantic concepts (like objects, locations, or activities) from the audiovisual content of a video stream. Given a vocabulary of concepts and an input video X, the task is to estimate concept scores (where are Boolean random variables indicating concept presence). These scores can be used to label videos, or to sort them into a ranked retrieval list for text-based search [40]. Concept detection is
Approach – online video tagging using TubeTagger
As has been outlined, the state-of-the-art approach of training concept detection systems is to manually acquire concept annotations of high quality but limited quantity. While joint community efforts have made it possible to train concept detectors for hundreds of concepts [49], several limitations remain regarding the time and cost associated with such training data acquisition:
- (1)
The number of samples required per concept is high due to strong intra-class variation of many concepts. For
Experiments
In this section, two experiments are presented in which the TubeTagger system is trained on real-world online videos downloaded from YouTube. In the first experiment, the system is both trained on and applied to online video content. When used in this setup, TubeTagger is targeted at an automatic content-based indexing and improved search in web video portals. The tagging performance achievable on web video is quantified, and the performance for several kinds of feature pipelines is compared.
Conclusions
In this paper, we have proposed to train concept detection systems on online videos publicly available at a large scale from portals such as YouTube. Thereby, high-quality manual training annotations are replaced with YouTube tags. This setup allows for an autonomous learning, which offers significant advantages in terms of scalability and flexibility by avoiding the tedious task of explicit manual annotation. A concept detection system named TubeTagger has been presented that acquires training
References (54)
Techniques used and open challenges to the analysis, indexing retrieval of digital video
Inf. Syst.
(2007)- R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to imbalanced datasets, in: Proc. Europ. Conf....
- A. Amir, M. Berg, S. Chang, W. Hsu, G. Iyengar, C.-Y. Lin, M. Naphade, A. Natsev, C. Neti, H. Nock, J. Smith, B. Tseng,...
- H. Bay, T. Tuytelaars, L. van Gool, SURF: speeded up robust features, in: Proc. Europ. Conf. Computer Vision, May 2006,...
- D. Borth, A. Ulges, C. Schulze, T. Breuel, Keyframe extraction for video tagging and summarization, in: Proc....
- C. Snoek et al., The MediaMill TRECVID 2007 semantic video search engine, in: Proc. TRECVID Workshop (Unreviewed...
- C. Snoek et al., The MediaMill TRECVID 2008 semantic video search engine, in: Proc. TRECVID Workshop (Unreviewed...
- M. Campbell, A. Haubold, M. Liu, A. Natsev, J. Smith, J. Tesic, L. Xie, R. Yan, J. Yang, IBM research TRECVID-2007...
- C.-C. Chang, C.-J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. Software available at...
- S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, J. Luo, Large-scale multimodal semantic concept...
Reliable transition detection in videos: a survey and Practitioner’s guide
Int. J. Img. Graph.
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
Cited by (52)
A data-driven approach for tag refinement and localization in web videos
2015, Computer Vision and Image UnderstandingCitation Excerpt :This idea follows similar successful approaches for image classification [33–35] but it has been applied only for the particular case of single-label classification. To this end, a first attempt has been made by Ulges et al. [36] who proposed to train a concept detection system on web videos from portals such as YouTube. A similar idea is presented in [19] in which images collected from the web are used to learn representations of human actions and then this knowledge is used to automatically annotate actions in unconstrained videos.
Recommendations for recognizing video events by concept vocabularies
2014, Computer Vision and Image UnderstandingVideo similarity measurement and search
2019, Studies in Computational IntelligenceLearning Traffic Behaviors by Extracting Vehicle Trajectories from Online Video Streams
2018, IEEE International Conference on Automation Science and EngineeringVideo epitomize and eigenvalue generation for web based video retrieval
2018, International Journal of Civil Engineering and TechnologyTSCSet: A crowdsourced time-sync comment dataset for exploration of user experience improvement
2018, International Conference on Intelligent User Interfaces, Proceedings IUI