Skip to main content

Über dieses Buch

The two-volume set LNCS 7732 and 7733 constitutes the thoroughly refereed proceedings of the 19th International Conference on Multimedia Modeling, MMM 2012, held in Huangshan, China, in January 2013.

The 30 revised regular papers, 46 special session papers, 20 poster session papers, and 15 demo session papers, and 6 video browser showdown were carefully reviewed and selected from numeroues submissions. The two volumes contain papers presented in the topical sections on multimedia annotation I and II, interactive and mobile multimedia, classification, recognition and tracking I and II, ranking in search, multimedia representation, multimedia systems, poster papers, special session papers, demo session papers, and video browser showdown.



Special Session Papers

Mobile-Based Multimedia Analysis

Quality Assessment on User Generated Image for Mobile Search Application

Quality specified image retrieval is helpful to improve the user experiences in mobile searching and social media sharing. However, the model for evaluating the quality of the user generated images, which are popular in social media sharing, remains unexploited. In this paper, we propose a scheme for quality assessment on user generated image. The scheme is formed by four attribute dimensions, including intrinsic quality, favorability, relevancy and accessibility of images. Each of the dimensions is defined and modeled to pool a final quality score of a user generated image. The proposed scheme can reveal the quality of user generated image in comprehensive manner. Experimental results show that the scores obtained by our scheme have high correlation coefficients with the benchmark data. Therefore, our scheme is suitable for quality specified image retrieval on mobile applications.

Qiong Liu, You Yang, Xu Wang, Liujuan Cao

2D/3D Model-Based Facial Video Coding/Decoding at Ultra-Low Bit-Rate

For facial video coding/decoding in mobile communication, a 2D/3D model-based system was proposed: (1)Online appearance model and cylinder head model were combined to track 3D facial motion in particle filter; (2)3D facial animation was produced combining parameterized model and muscular model; (3)3D hair was synthesized by hair detection and 3D hair model; (4)3D coding/decoding of foreground and 2D coding/decoding of background were stitched. At ultra-low bit-rate, the object experiment confirmed the advantage between coding efficiency and decoding quality of it, and the between subjects experiment indicated the suitability of subjective face identification by it.

Jun Yu, Zengfu Wang, Yang Cao

Hierarchical Text Detection: From Word Level to Character Level

Text detection is a challenging task in computer vision. In this paper, we focus on English text detection in a natural scene image. We propose a hierarchical approach for text detection, which unifies the word-level text detection and character-level detection as well as the text spatial layout. In our approach, we firstly use stroke width transformation (SWT) to filter an image in a word level. Secondly, we employ the random forest to select discriminative features of characters and compute the confident values of characters. Finally, we use conditional random field to integrate the discriminative information with the text spatial layout, which separates the text from the background. The proposed approach is implemented on the ICDAR dataset, which is a challenging dataset for text detection, and the experiment results demonstrate that our approach is efficient and effective, and it is superior to the state-of-the-art methods in comprehensive criteria.

Yanyun Qu, Weimin Liao, Shen Lu, Shaojie Wu

Building a Large Scale Test Collection for Effective Benchmarking of Mobile Landmark Search

Studying and analyzing system performance is one of the fundamental factors for the related technological advancement in image retrieval. In this paper, we report the construction of a large scale test collection for facilitating robust performance evaluation of mobile landmark image search. Totally, the test collection consists of (1) 355,141 images about 128 landmarks in five cities over 3 continents from Flickr; (2) different kinds of textual features for each image, including surrounding text (e.g. tags), contextual data (e.g. geo-location and upload time), and metadata (e.g. uploader and EXIF); and (3) six types of low-level visual features. For the task of landmark image retrieval evaluation, we also conduct a series of baseline experimental studies on the search performance over different visual queries, which represent different views of a landmark.

Zhiyong Cheng, Jing Ren, Jialie Shen, Haiyan Miao

Geographical Retagging

While the geographical tag has brought a novel insight into the multimedia content analysis and understanding, how to improve the tagging accuracy has been rarely exploited. In this paper, we present a novel geographical retagging algorithm to improve the inaccurate geographical tags from an automatic photo content based association and refinement perspective. We do not resort to the time-consuming camera pose estimation and scene geometry recovery schemes like structure-from-motion. Instead, our algorithm is deployed based on a very simple neighbor statistical significance test, i.e., geographically nearby images, if near duplicate, should follow a more smooth affine transform comparing with those farther aways. Such an assumption is robust to noisy photo contents caused by multiple factors, such as indoor/outdoor changes, occlusions, or viewing angle changes. It is also very fast comparing to alternative approaches like structure-from-motion or simultaneous localization and matching. We have shown the accuracy, efficiency, and robustness of the proposed retagging algorithm for refining the geographical tags of Flickr images.

Liujuan Cao, Yue Gao, Qiong Liu, Rongrong Ji

Multimedia Retrieval and Management with Human Factors

Recompilation of Broadcast Videos Based on Real-World Scenarios

In order to effectively make use of videos stored in a broadcast video archive, we have been working on their recompilation. In order to realize this, we take an approach that considers the videos in the archive as video materials, and recompiling them by considering various kinds of social media information as “scenarios”. In this paper, we will introduce our works in news, sports, and cooking domains, that makes use of Wikipedia articles, demoscopic polls, twitter tweets, and cooking recipes in order to recompile video clips from corresponding TV shows.

Ichiro Ide

Evaluating Novice and Expert Users on Handheld Video Retrieval Systems

Content-based video retrieval systems have been widely associated with desktop environments that are largely complex in nature, targeting expert users and often require complex queries. Due to this complexity, interaction with these systems can be a challenge for regular ”novice” users. In recent years, a shift can be observed from this traditional desktop environment to that of handheld devices, which requires a different approach to interacting with the user. In this paper, we evaluate the performance of a handheld content-based video retrieval system on both expert and novice users. We show that with this type of device, a simple and intuitive interface, which incorporates the principles of content-based systems, though hidden from the user, attains the same accuracy for both novice and desktop users when faced with complex information retrieval tasks. We describe an experiment which utilises the Apple iPad as our handheld medium in which both a group of experts and novice users run the interactive experiments from the 2010 TRECVid Known-Item Search task. The results indicate that a carefully defined interface can equalise the performance of both novice and expert users.

David Scott, Frank Hopfgartner, Jinlin Guo, Cathal Gurrin

Perfect Snapping

Interactive image matting is a process that extracts a foreground object from an image based on limited user input. In this paper, we propose a novel interactive image matting algorithm named Perfect Snapping which is inspired by the presented method named Lazy Snapping technique. In the algorithm, the mean shift algorithm with a boundary confidence prior is introduced to efficiently pre-segment the original image into homogeneous regions (super-pixels) with precise boundary. Secondly, Gaussian Mixture Model (GMM) clustering algorithm is used to describe and to model the super-pixels. Finally, a Monte Carlo based Expectation Maximization (EM) algorithm is used to perform parametric learning of mixture model for priori knowledge. Experimental results indicate that the proposed algorithm can achieve higher matting quality with higher efficiency.

Qingsong Zhu, Ling Shao, Qi Li, Yaoqin Xie

Smart Video Browsing with Augmented Navigation Bars

While accuracy and speed get a lot of attention in video retrieval research, the investigation of interactive retrieval tools gets less attention and is often regarded as trivial. We want to show that even simple ideas have potential to improve the retrieval performance by giving some automated support to the browsing user. We present a video browsing concept where video segments are clustered in several latent classes of similar content. The navigation bars of our video browser are augmented with different colors indicating where elements of these clusters are located. As humans are able to classify the content of clusters fast, they can benefit from this information when browsing a video. We present a study where we investigated how humans can be supported in different video browsing tasks with a color-based and a motion-based clustering of video content.

Manfred Del Fabro, Bernd Münzer, Laszlo Böszörmenyi

Human Action Search Based on Dynamic Shape Volumes

In this paper, an interactive system for human action video search is developed based on the dynamic shape volumes. The user is allowed to create a search query by freely and continuously posing any number of actions in front of the Kinect sensor. For the captured query video sequence and each data stream of the human action video database, we extracted useful shape properties on the basis of space-time volumes by exploiting the solution to the Poisson equation. Different from conventional learning-based human action recognition techniques, we apply approximate string matching (ASM) to achieve local alignment for the matching of two video sequences. The experiments demonstrate the effectiveness of our system in support of the user’s search task.

Hong-Ming Chen, Wen-Huang Cheng, Min-Chun Hu, Yan-Ching Lin, Yung-Huan Hsieh

Video Retrieval Based on User-Specified Appearance and Application to Animation Synthesis

In our research group, we investigate techniques for retrieving videos based on user-specified appearances. In this paper, we introduce two of our research activities.

First, we present a user interface for quickly and easily retrieving scenes of a desired appearance from videos. Given an input image, our system allows the user to sketch a transformation of an object inside the image, and then retrieves scenes showing this object in the user-specified transformed pose. Our method employs two steps to retrieve the target scenes. We first apply a standard image-retrieval technique based on feature matching, and find scenes in which the same object appears in a similar pose. Then we find the target scene by automatically forwarding or rewinding the video, starting from the frame selected in the previous step. When the user-specified transformation is matched, we stop forwarding or rewinding, and thus the target scene is retrieved. We demonstrate that our method successfully retrieves scenes of a racing car, a running horse, and a flying airplane with user-specified poses and motions.

Secondly, we present a method for synthesizing fluid animation from a single image, using a fluid video database. The user inputs a target painting or photograph of a fluid scene. Employing the database of fluid video examples, the core algorithm of our technique then automatically retrieves and assigns appropriate fluid videos for each part of the target image. The procedure can thus be used to handle various paintings and photographs of rivers, waterfalls, fire, and smoke, and the resulting animations demonstrate that it is more powerful and efficient than our prior work.

Makoto Okabe, Yuta Kawate, Ken Anjyo, Rikio Onai

Location Based Social Media

Landmark History Visualization

Landmark image mining and detection have been studied for many years, however, most of the existing work focuses on their spatial attributes, while largely ignoring the temporal information in specific ones, which are taken during historical moments. This kind of images are more valuable than the normal ones as they not only contain more comprehensive information to illustrate the landmarks in different moments, but also are useful in many real world applications, such as tour recommendation and history education. In this paper, we present a novel framework named Landmark History Visualization (LHV) to mine relevant and diverse images for each landmark’s historic moments. There are two steps in LHV. The first one is to extract the event list of each landmark from Wikipedia. The event keywords are extracted, and some of them are automatically labeled as 3W (What, Who, When). In the second step, images searched by the landmark name are firstly collected from Flickr and Google images. Secondly, we employ manifold ranking with detected 3W to retrieve the relevant images, and lastly, an outlier detection and diversification based re-ranking approach is introduced to provide users with various returned images. We implemented our approach on 6 landmarks and the results demonstrate the effectiveness of LHV.

Weiqing Min, Bing-Kun Bao, Changsheng Xu

Discovering Latent Clusters from Geotagged Beach Images

This paper studies the problem of estimating geographical locations of images. To build reliable geographical estimators, an important question is to find distinguishable geographical clusters in the world. Those clusters cover general geographical regions and are not limited to landmarks. The geographical clusters provide more training samples and hence lead to better recognition accuracy. Previous approaches build geographical clusters using heuristics or arbitrary map grids, and cannot guarantee the effectiveness of the geographical clusters. This paper develops a new framework for geographical cluster estimation, and employs latent variables to estimate the geographical clusters. To solve this problem, this paper employs the recent progress in object detection, and builds an efficient solver to find the latent clusters. The results on beach datasets validate the success of our method.

Yang Wang, Liangliang Cao

3D Video Depth and Texture Analysis and Compression

An Error Resilient Depth Map Coding Scheme Using Adaptive Wyner-Ziv Frame

Depth information is one of the most important parameters in three dimensional videos (3DV). While transmitted through error-prone networks, the distortion in depth map due to packet loss will lead to a geometric error during the process of Depth Image Based Rendering (DIBR) and affect the rendered video quality. In this paper, we propose an error resilient depth map coding scheme using adaptive Wyner-Ziv (WZ) frames. For each depth frame, whether to be encoded into a Wyner-Ziv frame is decided by a joint source-channel R-D optimization (JSC-RDO) algorithm. JSC-RDO involves in the end-to-end distortion model of depth coding and the estimation of expected rate and distortion of WZ coding. Motion information of the corresponding color video are used to correct the depth error and generate the side information for WZ coding. The Lagrange multiplier used in JSC-RDO was derived taking into account both the packet-loss environment and the rendered view distortion. Experimental results show that the proposed error resilient scheme achieves a better overall R-D performance than existing schemes.

Xiangkai Liu, Qiang Peng, Xiao Wu, Lei Zhang, Xu Xia, Lingyu Duan

A New Closed Loop Method of Super-Resolution for Multi-view Images

In this paper, we propose a closed loop method to resolve the multi-view super-resolution problems. Given that the input is one high-resolution view along with its neighboring low-resolution views, our method can give the super-resolution results and obtain a high quality depth map simultaneously. The closed loop method consists of two parts, part I: stereo matching and depth maps fusion and part II: super-resolution. Under the guidance of the depth information, the super-resolution process can be divided into three steps, disparity based pixel mapping, nonlocal construction and final fusion. Once we have the super-resolution results, we can update the disparity maps, and in addition, use the proposed 3D-median filter to update the depth map. We repeat the loop for several times to obtain the high quality super-resolution results and depth map simultaneously. The experimental results show that the proposed method can achieve high quality performance at varies scale factors.

Jing Zhang, Yang Cao, Zhigang Zheng, Zengfu Wang

Fast Coding Unit Decision Algorithm for Compressing Depth Maps in HEVC

The stereoscopic video system creates a 3-D scene using the color map and the virtual view which is synthesized by the color and depth maps. Compressing 3-D video sequences is a crucial research issue. Based on Zhao

et al.

’s depth no-synthesis-error model, this paper presents a fast coding unit decision algorithm for compressing depth maps in HEVC (High Efficiency Video Coding). The proposed coding unit decision algorithm determines, as early as possible, the minimum coding unit size required in the quadtree structure of HEVC while preserving an acceptable quality. Experimental results demonstrate that the proposed coding unit decision algorithm computationally outperforms the one used in HEVC while incurring negligible degradation on both quality and bitrate performance.

Yung-Hsiang Chiu, Kuo-Liang Chung, Wei-Ning Yang, Yong-Huai Huang, Chih-Ming Lin

Fast Mode Decision Based on Optimal Stopping Theory for Multiview Video Coding

Optimal stopping theory is developed to achieve a good trade-off between decision performance and decision efforts such as the consumed decision time. In this paper, the optimal stopping theory is applied to fast mode decision for multiview video coding in order to reduce the tremendous encoding computational complexity, with the benefit of theoretical decision-making expectation and predictable decision performance. The characteristics of encoding modes in multiview video coding are studied to derive an optimal stopping theory-based model to early terminate mode decision and thus a fast mode decision algorithm is designed. Extensive experimental results demonstrate that the proposed algorithm can save a great amount of encoding time for multiview video coding and meanwhile keep the compression performance more or less intact.

Hanli Wang, Yue Heng, Tiesong Zhao, Bo Xiao

Inferring Depth from a Pair of Images Captured Using Different Aperture Settings

Given two pictures of the same scene captured using the same camera and the same lens, the one captured with a large aperture will appear partially blurred while the other captured with a small aperture will appear totally sharp. This paper investigates two possible ways of inferring depth of the scene from such an image pair with the constraint that both pictures are focused on the closest point of the scene. Our first method uses a series of Gaussian kernels to blur the image pair, and in the second method, the image pair will be shrunk to a series of smaller dimensions. In both methods, sharp areas in both images will always stay similar to each other, whereas the areas that appear sharp in one image but blurred in the other will not be similar until they are blurred using a large Gaussian kernel or shrunk to small dimensions. This observation enables us to roughly tell which objects in the scene are closer to us and which ones are farther away. At the end of this paper, we will discuss the limitations of our proposed approaches and some of the directions for future work.

Yujun Li, Oscar C. Au, Lingfeng Xu, Wenxiu Sun, Wei Hu

Large-Scale Rich Media Search and Management in the Social Web

Multiple Instance Learning for Automatic Image Annotation

Most traditional approaches for automatic image annotation cannot provide reliable annotations at the object level because it could be very expensive to obtain large amounts of labeled object-level images associated to individual regions. To reduce the cost for manually annotating at the object level, multiple instance learning, which can leverage loosely-labeled training images for object classifier training, has become a popular research topic in the multimedia research community. One bottleneck for supporting multiple instance learning is the computational cost on searching and identifying positive instances in the positive bags. In this paper, a novel two-stage multiple instance learning algorithm is developed for automatic image annotation. The affinity propagation(AP) clustering technique is performed on the instances both in the positive bags and the negative bags to identify the candidates of the positive instances and initialize the maximum searching of Diverse Density likelihood in the first stage. In the second stage, the most positive instances are then selected out in each bag to simply the computing procedure of Diverse Density likelihood. Our experiments on two well-known image sets have provided very positive results.

Zhaoqiang Xia, Jinye Peng, Xiaoyi Feng, Jianping Fan

Combining Topic Model and Relevance Filtering to Localize Relevant Frames in Web Videos

Numerous web videos associated with rich metadata are available on the Internet today. While such metadata like video tags bring us facilitations and opportunities for video search and multimedia content understanding, some challenges also arise due to the fact that those video tags are usually annotated at the video level while many tags actually only describe parts of the video content. Thus how to localize the relevant parts or frames of web video for given tags is the key to many applications and research tasks. In this paper we propose to combine topic model and relevance filtering to localize relevant frames. Our method is designed in three steps. First we apply relevance filtering to assign relevance scores to video frames and a raw relevant frame set is obtained by selecting the top ranked frames. Then we separate the frames into topics by mining the underlying semantics using Latent Dirichlet Allocation and use the raw relevance set as validation set to select relevant topics. Finally, the topical relevances are used to refine the raw relevant frame set and the final results are obtained. Experiment results on real web videos validate the effectiveness of the proposed approach.

Lei Yi, Haojie Li, Shi-Yong Neo

A Lightweight Fingerprint Recognition Mechanism of User Identification in Real-Name Social Networks

Today, the popularity of social networks poses a great threat to user’s information, thus the work of security and privacy protection is becoming increasingly important and urgent. This paper aims to explore the problem of user identification based on biometrics methods in security and privacy issues for social networks sites. In this paper, we propose a lightweight fingerprint recognition mechanism of user identification in real-name social networks. We describe the architecture by using block diagram of our proposed lightweight fingerprint recognition system, and explain how the important steps of proposed mechanism such as minutiae detection, lightweight operation and minutiae matching are implemented. We have performed the experiments to evaluate the user identification reliability of the proposed mechanism. The results of the experiments show that the performance of our lightweight fingerprint recognition system is realistic.

Haibin Cai, Zishan Qin, Yunyun Su, Junnan Tu, Linhua Jiang

A Novel Binary Feature from Intensity Difference Quantization between Random Sample of Points

With the explosive growth of web multimedia data, how to manage and retrieval the web-scale data more efficiently has become a urgent problem, which expects more efficient low-level feature with low computation. This pressing need brings a huge challenge to the conventional feature. It is urgent to make descriptor more compact and faster and meanwhile remain robust to many different kinds of image transformation. To this end, this paper proposed one kind of fast descriptor for local patch. It consists of a string of binary bits which are derived from the intensity difference quantization (IDQ) between pixel pairs which are chosen according to a fixed random sample pattern, so we called it DIDQ (descriptor based on IDQ). Our experiments show that DIDQ is very fast to be computed and also more robust than the other existing binary represented features.

Dongye Zhuang, Dongming Zhang, Jintao Li, Qi Tian

Beyond Kmedoids: Sparse Model Based Medoids Algorithm for Representative Selection

We consider the problem of seeking representative subset of dataset, which can efficiently serve as the condensed view of the entire dataset. The Kmedoids algorithm is a commonly used unsupervised method, which selects center points as representatives. Those center points are mainly located in high density areas and surrounded by other data points. However, boundary points in the low density areas, which are useful for classification problem, are usually overlooked. In this paper we propose a sparse model based medoids algorithm (Smedoids) which aims to learn a special dictionary. Each column of this dictionary is a representative data point from the dataset, and each data point of the dataset can be described well by a linear combination of the columns of this dictionary. In this way, center and boundary points are all selected as representatives. Experiments evaluate the performances of our method for finding representatives of real datasets on the image and video summarization problem and the multi-class classification problem, and our method is shown to out-perform state-of-the-art in accuracy.

Yu Wang, Sheng Tang, FeiDie Liang, YaLin Zhang, JinTao Li

Improving Automatic Image Tagging Using Temporal Tag Co-occurrence

Existing automatic image annotation (AIA) systems that depend solely on low-level image features often produce poor results, particularly when annotating real-life collections. Tag co-occurrence has been shown to improve image annotation by identifying additional keywords associated with user-provided keywords. However, existing approaches have treated tag co-occurrence as a static measure over time, thereby ignoring the temporal trends of many tags. The temporal distribution of tags, however, caused by events, seasons and memes, etc, provides a strong source of evidence beyond keywords for AIA. In this paper we propose a temporal tag co-occurrence approach to improve AIA accuracy. By segmenting collection tags into multiple co-occurrence matrices, each covering an interval of time, we are able to give precedence to tags which not only co-occur each other, but also have temporal significance. We evaluate our approach on a real-life timestamped image collection from Flickr by performing experiments over a number of temporal interval sizes. Results show statistically significant improvements to annotation accuracy compared to a non-temporal co-occurrence baseline.

Philip McParlane, Stewart Whiting, Joemon Jose

Robust Detection and Localization of Human Action in Video

We propose a robust and efficient method for accurate detecting and localizing complex human action in video in space and time dimensions using spatio-temporal templates. A simple but effective motion descriptor based on the motion-compensated frame difference is designed for template representation, which is resistant to the deformation of posture and cluttered and moving background. A multi-step filtering scheme is adopted to speed up the target candidates localization and matching to the templates. For the template sequence to video registration, we present an extended continuous dynamic programming technique which can compute the matching scores for multiple trajectories simultaneously. Extensive experimental results on different videos have demonstrated the effectiveness of the proposed method.

Haojie Li, Fuming Sun, Yue Guan

Multimedia Content Analysis Using Social Media Data

A Sparse Coding Based Transfer Learning Framework for Pedestrian Detection

Pedestrian detection is a fundamental problem in video surveillance and has achieved great progress in recent years. However, training a generic detector performing well in a great variety of scenes has been approved to be very difficult. On the other hand, exhausting manual labeling effort for each specific scene to achieve high accuracy of detection is not acceptable especially for video surveillance applications. In order to alleviate the manual labeling effort without scarifying accuracy of detection, we propose a transfer learning framework to automatically train a scene-specific pedestrian detector starting from a pre-trained generic detector. In our framework, sparse coding is proposed to calculate similarities between source samples and a small set of selected target samples by using the former as dictionary. The similarities are later used to calculate weights of source samples. The weights of initially detected target samples are calculated in a similar way but using the selected target dataset as dictionary. By using these weighted samples during re-training process, our framework can efficiently get a scene-specific pedestrian detector. Our experiments on VIRAT dataset show that our trained scene-specific pedestrian detector performs well and it is comparable with the detector trained on a large number of training samples manually labeled from the target scene.

Feidie Liang, Sheng Tang, Yu Wang, Qi Han, Jintao Li

Sampling of Web Images with Dictionary Coherence for Cross-Domain Concept Detection

Due to the existence of cross-domain incoherence resulting from the mismatch of data distributions, how to select sufficient positive training samples from scattered and diffused web resources is a challenging problem in the training of effective concept detectors. In this paper, we propose a novel sampling approach to select coherent positive samples from web images for further concept learning based on the degree of image coherence with a given concept. We propose to measure the coherence in terms of how dictionary atoms are shared since shared atoms represent common features with regard to a given concept and are robust to occlusion and corruption. Thus, two kinds of dictionaries are learned through online dictionary learning methods: one is the concept dictionary learned from key-point features of all the positive training samples while the other is the image dictionary learned from those of web images. Intuitively, the coherence degree is then calculated by the Frobenius norm of the product matrix of the two dictionaries. Experimental results show that the proposed approach can achieve constant overall improvement despite cross-domain incoherence.

Yongqing Sun, Kyoko Sudo, Yukinobu Taniguchi, Masashi Morimoto

Weakly Principal Component Hashing with Multiple Tables

Image hashing based Approximate Nearest Neighbor (ANN) searching has drawn much attention in large-scale image dataset application, where balance the precision and high recall rate is difficulty task. In this paper, we propose a weakly principal component hash method with multiple tables to encode binary codes. Analyzing the distribution of projected data on different principal component directions, we find that neighbors which are far in some principal component directions maybe near in the other directions. Therefore, we construct multiple-table hashing to search the missed positive samples by previous tables, which can improve the recall. For each table, we project data to different principal component directions to learn hashing functions. In order to improve the precision rate, neighborhood points in Euclidean space should also be neighborhoods in Hamming space. So we optimize the projected data using orthogonal matrix to preserve the structure of the data in the Hamming space. Experimental and compared with six hashing results on public datasets show that the proposed method is more effective and outperforms the state-of-the-art.

Haiyan Fu, Xiangwei Kong, Yanqing Guo, Jiayin Lu

DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video

Nowadays, numerous social videos have pervaded on the Web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally those context data such as tags are generated for the whole video, without temporal indication on when they actually appear in the video. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called


, which contains 1550 videos collected from


portal by issuing 31 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video. Besides the video itself, the contextual information, such as thumbnail images, titles, and categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.

Haojie Li, Lei Yi, Yue Guan, Hao Zhang

Clothing Extraction by Coarse Region Localization and Fine Foreground/Background Estimation

Online shopping is becoming more and more popular for billions of web users because of its convenience and efficiency. Customers can use content-based product image search engine to find their desired products. However, a frustrating fact is that the search results are significantly affected by the presence of natural backgrounds and fashion models. To minimize the influence of these noises, in this paper, an automatic clothing extraction algorithm is proposed, which consists of two phases: coarse clothing region localization with human proportion, and fine foreground/background modeling. Experiments on two datasets crawled from e-commerce websites demonstrate that the proposed approach achieves good performance, and has competitive performance with the interactive solution.

Xiao Wu, Bo Zhao, Ling-Ling Liang, Qiang Peng

Object Categorization Using Local Feature Context

Recently, the use of context has been proven very effective for object categorization. However, most of the researchers only used context information at the visual word level without considering the context information of local features. To tackle this problem, in this paper, we propose a novel object categorization method by considering the local feature context. Given a position in an image, to represent this position’s visual information, we use the local feature on this position as well as other local features based on their distances and angles to this position. The use of local feature context is more discriminative and is also invariant to rotation and scale change. The local feature context can then be combined with the state-of-the-art methods for object categorization. Experimental results on the UIUC-Sports dataset and the Caltech-101 dataset demonstrate the effectiveness of the proposed method.

Tao Sun, Chunjie Zhang, Jing Liu, Hanqing Lu

Statistical Multiplexing of MDFEC-Coded Heterogeneous Video Streaming

In this paper, we propose an approach that combines statistical multiplexing and Multiple Description with Forward Error Correction (MDFEC) coding to make the optimal use of server bandwidth, taking into account the dynamically evolving content (complexity) of the different videos being streamed by the server, and path bandwidths and loss rates experienced by the users. We formally pose and analyze the complexity of the MDFEC statistical multiplexing problem, and present a dynamic programming based polynomial-time algorithm to compute the optimum solution. We also evaluate the performance of the proposed approach against those that do not use either MDFEC or statistical multiplexing, based on the experimental results obtained from real video sequences. Besides optimizing the overall distortion across all users, our approach is quite effective in providing differentiation between user groups with significantly different path bandwidths, particularly when a weighted version of our approach is used.

Hang Zhang, Adarsh K. Ramasubramonian, Koushik Kar, John W. Woods

Related HOG Features for Human Detection Using Cascaded Adaboost and SVM Classifiers

Robust and fast human detection in static image is very important for real applications. Although different feature descriptors have been proposed for human detection, for HOG descriptor, how to select and combine more distinguish block-based HOGs, and how to simultaneously make use of the correlation and the local information of these selected HOGs still lack enough research and analysis. In this paper, we present a set of Related HOG (RHOG) features, including distinctive block-based HOGs (Ele-HOGs) which are selected by Adaboost and a global HOG descriptor which is concatenated by Ele-HOGs (CSele-HOG). Ele-HOG can discriminatively describe local distribution of human object while CSele-HOG contains global information. In addition, we propose a novel human detection framework of Cascaded Adaboost and SVM classifiers (CAS) based on RHOG features, which combines the advantages of Adaboost and SVM classifiers. Experimental results on INRIA dataset demonstrate the effectiveness of the proposed method.

Hong Liu, Tao Xu, Xiangdong Wang, Yueliang Qian

Face Recognition Using Multi-scale ICA Texture Pattern and Farthest Prototype Representation Classification

In this paper, we present a novel approach to improve the performance of face recognition. To represent face images, we propose an effective texture descriptor, i.e., multi-scale ICA texture pattern (MITP). MITP generates multiple encoded images according to the order of response images by learned independent component analysis (ICA) filters of various scales, and then concatenates the MITP histograms from non-overlapping subregions of the encoded images into a single histogram. Based on a fundamental concept that a specific class can be modeled by a single query-dependent prototype, we introduce a simple classifier without parameter tuning, in which the decision is made using the farthest prototype rule. Moreover, a simple feature remapping strategy can further boost the performance. Experiments on two widely-used face databases demonstrate the effectiveness of our approach over other methods.

Meng Wu, Jun Zhou, Jun Sun

Detection of Biased Broadcast Sports Video Highlights by Attribute-Based Tweets Analysis

We propose a method for detecting biased-highlights in a broadcast sports video according to viewers’ attributes obtained from a large number of tweets. Recently, Twitter is widely used to make real-time play-by-play comments on TV programs, especially on sports games. This trend enables us to effectively acquire the viewers’ interests in a large mass. In order to make use of such tweets for highlight detection in broadcast sports video, the proposed method first performs an attribute analysis on the set of tweets issued by each user to classify which team he/she supports. It then detects biased-highlights by referring to the number of tweets made by viewers with a specific attribute.

Takashi Kobayashi, Tomokazu Takahashi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase

Cross-Media Computing for Content Understanding and Summarization

Temporal Video Segmentation to Scene Based on Conditional Random Fileds

In this paper, we propose a novel approach of video segmentation into scenes based on the technique of conditional random fields (CRFs). This approach is built upon the design in which scene segmentation is transformed into a label identification problem by defining three types of shots. To implement our algorithm, three middle-level features including shot difference signal, scene transition graph and audio type are extracted to depict the label properties of each shot, and then CRFs model is employed to identify the labels sequence. The advantage of CRFs model lies in its facility in integrating context information of neighboring shots, which produces accurate results in scene segmentation. The proposed approach is verified by seven types of data covering the most major genres of TV program. Experiments on testing data set yield average 0.88 F-measure, which illustrates that the proposed method can accurately detect most scenes in different genres of programs.

Su Xu, Bailan Feng, Bo Xu

Improving Preservation and Access Processes of Audiovisual Media by Content-Based Quality Assessment

Quality assessment of audiovisual files is an important tool in many steps of the preservation workflow, as well as for use and access of archive material. Today mainly technical properties of the files can be checked, e.g. file integrity or standards compliance of file wrappers and encoded streams. Checking the audiovisual quality manually results in extremely high labor costs. In this work we present a semi-automatic quality assessment approach that combines the efficiency of fully automatic detection with the interpretation capability of humans to provide verified high quality assessment results. We also address the issue of interoperable metadata for quality assurance, discussing the state of the art and the gaps, and propose a framework for describing visual quality analysis results, which fills one of these gaps.

Peter Schallauer, Hannes Fassold, Albert Hofmann, Werner Bailer, Stefanie Wechtitsch

Distribution-Aware Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) has been popularly used in content-based search systems. There exist two main categories of LSH methods: one is to index the original data in an effective way to accelerate search process; the other one is to embed the high-dimensional data into hamming space and perform bit-wise operations to search similar objects. In this paper, we propose a new LSH scheme, called Distribution-Aware LSH (DALSH), to address the problem of lacking adaptation to real data, which is the intrinsic limitation of most LSH methods belong to the former category. In DALSH, a given dataset is embedded into a low-dimensional space with projection vectors learned from data, followed by deriving hash functions from the distribution of the dimension-reduced data. We also present a multi-probe strategy to improve the query performance. Experimental comparisons with the state-of-the-art LSH methods on two high-dimensional datasets demonstrate the efficacy of DALSH.

Lei Zhang, Yongdong Zhang, Dongming Zhang, Qi Tian

Cross Concept Local Fisher Discriminant Analysis for Image Classification

Distance metric learning is widely used in many visual computing methods, especially image classification. Among various metric learning approaches, Fisher Discriminant Analysis (FDA) is a classical metric learning approach utilizing the pair-wise semantic similarity and dissimilarity in image classification. Moreover, Local Fisher Discriminant Analysis (LFDA) takes advantage of local data structure in FDA and achieves better performance. Both FDA and LFDA can only deal with images with simple concept relations, where images either belong to the same concept category or come from different categories. However, in real application scenarios, images usually contain multiple concepts, and relations of concepts and images are complex. In this paper, to improve the flexibility of LFDA on the complex image-concept relations, we propose a new pairwise constraints method called Cross Concept Local Fisher Discriminant Analysis (C


LFDA) for image classification. By considering the cross concept images as a special case of within-class samples, C


LFDA models the semantic relations of images for distance metric learning under the framework of LFDA. We calculate within-class and between-class scatter matrix based on the proposed re-weighting scheme and local manifold structure. By solving the objective function of discriminant analysis using the proposed scheme, a set of projected representation is obtained to better reflect the complex semantic relations among images. Experimental evaluations and comparisons show the effectiveness of the proposed method.

Xinhang Song, Shuqiang Jiang, Shuhui Wang, Jinhui Tang, Qingming Huang

A Weighted One Class Collaborative Filtering with Content Topic Features

A task that naturally emerges in recommender system is to improve user experience through personalized recommendations based on user’s implicit feedback, such as news recommendation and scientific paper recommendation. Recommendations dealing with implicit feedback are most thought of as One Class Collaborative Filtering (OCCF), which only positive examples can be observed and the majority of data are missing. The idea to introduce weights for treating missing data as negatives has been shown to help in OCCF. But existing weighting approaches mainly use the statistical properties of feedback to determine the weight, which are not very reasonable and not personalized for each user-item pair. In this paper, we propose to improve recommendation by considering the rich user and item content information to assist weighting the unknown data in OCCF. To incorporate the useful content information, we get a content topic feature for each user and item by using probabilistic topic modeling method, and determine the personalized weight of every unknown user-item pair by these content topic features. Extensive experiments show that our algorithm can achieve better performance than the state-of-art methods.

Ting Yuan, Jian Cheng, Xi Zhang, Qinshan Liu, Hanqing Lu

Contextualizing Tag Ranking and Saliency Detection for Social Images

Tag ranking and saliency detection are two key tasks for image understanding, and have attracted much attention in the past decades. In this paper, we investigate how to iteratively and mutually boost tag ranking and saliency detection by taking the outputs from one task as the context of the other one. Our method first computes an initial saliency value based on fusing multiple feature maps, and then iteratively refines saliency map based on the contextual information from image tag ranking. As a result, an integrated framework for tag saliency ranking which combines both visual attention model and multi-instance learning to investigate the saliency ranking order information. We show that this mutual reinforcement of saliency detection and tag ranking improves the performance by using this combined approach. Experiments conducted on Corel and Flickr image datasets demonstrate the effectiveness of the proposed framework.

Wen Wang, Congyan Lang, Songhe Feng

Illumination Variation Dictionary Designing for Single-Sample Face Recognition via Sparse Representation

This paper focuses on enhancing Sparse Representation based Classifier (SRC) in single-sample face recognition tasks under varying illumination conditions. The major contribution is two-fold: firstly, we present an interesting observation based on Lambertian reflectance model: the identity information will be canceled out by the pair-wise difference images from the same subject in logarithmic domain, and only the subject-independent illumination variation retains. Secondly, inspired from this observation, we propose to “borrow” illumination variations from any generic subject by constructing an illumination variation dictionary composed of pair-wise difference images of generic subjects in logarithmic domain to cover the possible illumination variations between test and gallery samples. Experimental results on Extended Yale B and FERET face databases demonstrate the superiority of our method.

Biao Wang, Weifeng Li, Qingmin Liao

Efficient Extraction of Feature Signatures Using Multi-GPU Architecture

Recent popular applications like online video analysis or image exploration techniques utilizing content-based retrieval create a serious demand for fast and scalable feature extraction implementations. One of the promising content-based retrieval models is based on the feature signatures and the signature quadratic form distance. Although the model proved its competitiveness in terms of the effectiveness, the slow feature extraction comprising costly k-means clustering limits the model only for preprocessing steps. In this paper, we present a highly efficient multi-GPU implementation of the feature extraction process, reaching more than two orders of magnitude speedup with respect to classical CPU platform and the peak throughput that exceeds 8 thousand signatures per second. Such an implementation allows to extract requested batches of frames or images online without annoying delays. Moreover, besides online extraction tasks, our GPU implementation can be used also in a traditional preprocessing and training phase. For example, fast extraction allows indexing of huge databases or inspecting significantly larger parameter space when searching for an optimal similarity model configuration that is optimal according to both efficiency and effectiveness.

Martin Kruliš, Jakub Lokoč, Tomáš Skopal

Collaborative Tracking: Dynamically Fusing Short-Term Trackers and Long-Term Detector

This paper addresses the problem of long-term tracking of unknown objects in a video stream given its location in the first frame and without any other information. It’s very challenging because of the existence of several factors such as frame cuts, sudden appearance changes and long-lasting occlusions etc. We propose a novel collaborative tracking framework fusing short-term trackers and long-term object detector. The short-term trackers consist of a frame-to-frame tracker and a weakly supervised tracker which would be updated under the weakly supervised information and re-initialized by long-term detector while the trackers fail. Additionally, the short-term trackers would provide multiple instance samples on the object trajectory for training a long-term detector with the bag samples with P-N constraints. Comprehensive experiments and comparisons demonstrate that our approaches achieve better performance than the state-of-the-art methods.

Guibo Zhu, Jinqiao Wang, Changsheng Li, Hanqing Lu

A Real-Time Fluid Rendering Method with Adaptive Surface Smoothing and Realistic Splash

We present an adaptive approach in particle-based fluid simulation to smooth the surface rendered using splatting in screen space. A real-time effect of surface smoothing and edge preserving is achieved in both the situations that camera is close to or far away from the fluid. This method is based on Bilateral filtering and using an adaptive range coefficient according to the viewing distance, so that the filter offers more blurring effect while the camera is approaching the surface and more edge protection when the viewpoint is maintaining a long distance to the fluid. We also introduce a physics-based splash model in turbulent flow for real-time simulation with a corresponding rendering method. The local density of particles in SPH simulation and Weber number are used to determine the formation and breakup of splash particles. Based on the splash breakup regime in physics, a pattern is proposed to organize the shape formed by the newly generated breaking up particles.

Pengcheng Wang, Yong Zhang, Dehui Kong, Baocai Yin

Multi-document Summarization Exploiting Semantic Analysis Based on Tag Cluster

Multi-document summarization techniques aim to reduce the documents into a small set of words or paragraphs that convey the main meaning of the original documents. Many approaches for multi-document summarization have used probability based methods and machine learning techniques to summarize multiple documents sharing a common topic at the same time. However, these techniques fail to semantically analyze proper nouns and newly-coined words because most of them depend on old-fashioned dictionary or thesaurus. To overcome these drawbacks, we propose a novel multi-document summarization technique which employs the tag cluster on Flickr, a kind of folksonomy systems, for detecting key sentences from multiple documents. We first create a word frequency table for analyzing the semantics and contribution of words by using HITS algorithm. Then, by exploiting tag clusters, we analyze the semantic relationship between words in the word frequency table. The experimental results on TAC 2008, 2009 data sets demonstrate the improvement of our proposed framework over existing summarization systems.

Jee-Uk Heu, Jin-Woo Jeong, Iqbal Qasim, Young-Do Joo, Joon-Myun Cho, Dong-Ho Lee

Demo Session Papers

ShareDay: A Novel Lifelog Management System for Group Sharing

Lifelogging is the automatic capture of daily activities using environmental and wearable sensors such as MobilePhone/SenseCam. The potential to capture such a large data collection presents many challenges, including data analysis, visualisation and motivating users of different ages and technology experience to lifelog. In this paper, we present a new generation of lifelog system to support reminiscence through incorporating event segmentation and group sharing.

Lijuan Marissa Zhou, Niamh Caprani, Cathal Gurrin, Noel E. O’Connor

Helping the Helpers: How Video Retrieval Can Assist Special Interest Groups

Given the increasing broadcasting data and the ever decreasing spare time that we can spend on consuming this data, systems are required that assist us in identifying important content. Following a use case of a fictional social worker, we introduce a video retrieval system that is designed to assist special interest groups in their information gathering task.

Frank Hopfgartner, Jinlin Guo, David Scott, Hongyi Wang, Yang Yang, Zhenxing Zhang, Lijuan Marissa Zhou, Cathal Gurrin

Browsing Linked Video Archives of WWW Video

In this paper, we describe an interactive video browsing system based on a graph of linked video objects. The system automatically organizes unstructured video archives by exploiting visual content similarity between objects in the videos. By generating a video link graph, the system can conceptually groups the videos that contains same objects together for searching and browsing. Both the chosen measures of video object similarity and the video data mining technologies are discussed here and included in the related software demonstrator. In addition, the software offers a query-by-image-example video search capability to jump into the video graph at a certain point to begin browsing the archive.

Zhenxing Zhang, Cathal Gurrin, Jinlin Guo

Multi-camera Egocentric Activity Detection for Personal Assistant

We demonstrate an egocentric human activity assistant system that has been developed to aid people in doing explicitly encoded motion behavior, such as operating a home infusion pump in sequence. This system is based on a robust multi-camera egocentric human behavior detection approach. This approach detects individual actions in interesting hot regions by spatio-temporal mid-level features, which are built by spatial bag-of-words method in time sliding window. Using a specific infusion pump as a test case, our goal is to detect individual human actions in the operations of a home medical device to see whether the patient is correctly performing the required actions.

Longfei Zhang, Yue Gao, Wei Tong, Gangyi Ding, Alexander Hauptmann

Music Search Engine with Virtual Musical Instruments Playing Interface

In this paper, we presents a novel music search engine with query by playing the virtual musical instruments. Different from the previous query-by-keywords or query-by-hamming methods, the proposed search engine provides a new input interface, which allows the user to play simulated musical instruments to obtain the audio clip to do search. Since the sounds by playing certain musical instrument have the common standard, query-by-playing can effectively reduce the gap of users’ intention and input signals. In the other hand, search in this way can provide more possibilities for different kinds of people especially for the professionals to accurately retrieve more different types of music.

Mei Wang, Wei Mao, Hai-Kiat Goh

Navilog: A Museum Guide and Location Logging System Based on Image Recognition

We developed a computer vision-based mobile museum guide system named “Navilog”. It is a multimedia application for tablet devices. Using Navilog, visitors can take a picture of exhibits, and it identifies the exhibit and it shows additional descriptions and content related to it. It also enables them to log their locations within the museum. We made an experiment in the Railway Museum in Saitama, Japan.

Soichiro Kawamura, Tomoko Ohtani, Kiyoharu Aizawa

Early Skip Mode Detection by Exploring Extra Skip Patterns for H.264 Coarse Grain Quality Scalable Video Coding

We propose a fast early skip mode detection approach by exploiting extra skip patterns that are missed by previous research. With a larger early skip candidate set, a larger number of macroblocks at the enhancement layer could be detected as skipped mode. Experimental results demonstrate that, the proposed method achieves higher encoding time reduction with negligible quality loss compared with previous research results.

Hao Zhang, Xiao Yu Zhu, Xuan He

A Video Communication System Based on Spatial Rewriting and ROI Rewriting

Scalable Video Coding (SVC), as an extension of H.264/AVC, has been designed to provide H.264/AVC compatible base layer and spatial, temporal and quality enhancement layers. Bit-stream rewriting in SVC standard makes it possible to convert a quality enhancement layer to a H.264/AVC bit-stream. So that H.264/AVC decoder users could also experience high quality video content when network condition and hardware permits. In this paper, we present a scalable video rewriting system which is featured by the ability to rewrite spatial enhancement layers and range-of-interest (ROI) of enhancement layers. Compared to traditional rewriting, the proposed system is suitable for more application scenarios and is more flexible.

Hongtao Wang, Fangdong Chen, Bin Li, Dong Zhang, Houqiang Li

NExT-Live: A Live Observatory on Social Media

This demonstration presents a live observatory system named ‘


. It aims to analyze live online social media data to mine social phenomena, senses, influences and geographic trends dynamically. It builds an efficient and robust set of crawlers to continually crawl online social interactions on various social networking sites, covering contents from different facets and in different medium types. It then performs analysis to fuse these social media data to generate analytics at different levels. In particular, it researches into high-level analytics to mine senses of different target entitles, including People Sense, Location Sense, Topic Sense and Organization Sense.


provides a live observatory platform that enables people to know the happenings of the place in order to lead better life.

Huanbo Luan, Dejun Hou, Tat-Seng Chua

Online Boosting Tracking with Fragmented Model

We propose a novel method combining online boosting and fragment to overcome the drifting problem in on-line boosting tracking. We find that in previous on-line boosting method, the voting weights of the first few selectors are so big that the remainders can not affect the final strong classifier. This problem occurs because the voting weight of selectors are passing globally to adapt to the object variation, but usually only parts of object changes significantly in short time, and the changing part only affect its neighborhood, not the whole target area. So we divide the selector into fragments to get spatial information. The best weak classifier in each selector is combined linearly to get the final strong classifier and then find the location of the object in next frame. Experiments show robustness and generality of the proposed method.

Dingcheng Shen, Hua Zhang, Yanbing Xue, Guangping Xu, Zan Gao

Nonrigid Object Modelling and Visualization for Hepatic Surgery Planning in e-Health

This paper introduces an automatic approach of nonrigid object modelling and visualization for hepatic surgery planning, in particular, for live donor liver transplantation and accurate liver resection for cancer in e-health application. The proposed approach can build a system that supports radiologists in data preparation and gives surgeons precise information for making optimal decisions. It provides 3D representation of liver parenchyma and vasculature, and 3D simulation of patient specific data. The system is realized in four major stages, including registration of multimodal images; segmentation of liver parenchyma; extraction of liver vessels; and modelling and visualization of liver parenchyma and vessels. The approach is unique in that it integrates advanced techniques such as machine learning algorithm with a knowledge base of the organ. The details of these stages are described along with experimental results and discussions of the advantages of the approach over other approaches.

Suhuai Luo, Jiaming Li

Encoder/Decoder for Privacy Protection Video with Privacy Region Detection and Scrambling

Privacy region scrambling is an effective method to protect privacy information in videos. In this paper, we present an encoder/decoder system for privacy protection video. On the encoder side, the privacy region in video is automatically extracted and scrambled while encoding. On the decoder side, users can exactly restore the original video with a legitimate key otherwise only non-privacy part can be decoded correctly but the privacy regions are encrypted.

Feng Dai, Dongming Zhang, Jintao Li

TVEar: A TV-tagging System Based on Audio Fingerprint

This demo presents a TV-tagging system named TVEar based on audio fingerprint. It is a content-based audio information retrieval system, and has the ability to listen to a couple seconds of a TV show and determine what show is being watched. TVEar is robust to noisy environments, such as office/street/car environments. This system is designed to make a remarkable entry into social media.

Tao Jiang, Jiahong Li, Rihui Wu, Kang Xiang

VTrans: A Distributed Video Transcoding Platform

This demo presents a distributed video transcoding platform named VTrans, which utilizes the technology of distributed video transcoding. It can realize the fast transcoding of videos. The fast video transcoding method used in this platform is a video GOP-level and slice-level combined parallel mode, which can accelerate the process of video transcoding in time and space respectively. By using the system, users’ waiting time of transcoding a video is reduced, and the use ratio of system resource is enhanced.

Zhe Ouyang, Feng Dai, Junbo Guo, Yongdong Zhang

Fast ASA Modeling and Texturing Using Subgraph Isomorphism Detection Algorithm of Relational Model

In this paper, a new method based on Subgraph Isomorphism Detection (SID) algorithm was proposed to automatically construct Anhui-Styled Architecture(ASA) models. Firstly, by analyses intrinsic features of ASA, we setup architecture module database. Then use SID algorithm to get a topology graph and traverse each node of the topology graph. Finally, render these graph nodes to get 3D model of ASA.

Feng Xue, Xiaotao Wang, Feng Liang, Pingping Yang

Video Browser Showdown

An Approach for Browsing Video Collections in Media Production

This paper describes a video browsing tool for media (post-) production, enabling users to efficiently find relevant media items for redundant and sparsely annotated content collections. Users can iteratively cluster the content set by different features, and restrict the content set by selecting a subset of clusters. In addition, similarity search by different features is supported. Desktop and Web-based variants of the user interface, including temporal preview functionality, are available.

Werner Bailer, Wolfgang Weiss, Christian Schober, Georg Thallinger

DCU at MMM 2013 Video Browser Showdown

This paper describes a handheld video browser that in corporates shot boundary detection, key frame extraction, semantic content analysis, key frame browsing, and similarity search.

David Scott, Jinlin Guo, Cathal Gurrin, Frank Hopfgartner, Kevin McGuinness, Noel E. O’Connor, Alan F. Smeaton, Yang Yang, Zhenxing Zhang

AAU Video Browser with Augmented Navigation Bars

We present an improved version of last year’s winner of the Video Browser Showdown. In a preprocessing step video segments are detected and clustered in several latent classes of similar content based on color and motion information. The navigation bars of our video browser are then augmented with different colors indicating where elements of the detected clusters are located. As humans are able to classify the content of clusters fast, they can benefit from this information when browsing through a video.

Manfred Del Fabro, Bernd Münzer, Laszlo Böszörmenyi

NII-UIT-VBS: A Video Browsing Tool for Known Item Search

This paper introduces a video browsing tool for the known item search task. The key idea is to reduce the number of segments to further investigate by several ways such as applying visual filters and skimming representative keyframes. The user interface is optimally designed so as to reduce unnecessary navigations. Furthermore, a coarse-to-fine based approach is employed to quickly find the target clip.

Duy-Dinh Le, Vu Lam, Thanh Duc Ngo, Vinh Quang Tran, Vu Hoang Nguyen, Duc Anh Duong, Shin’ichi Satoh

VideoCycle: User-Friendly Navigation by Similarity in Video Databases

VideoCycle is a candidate application for this second Video Browser Showdown challenge. VideoCycle allows interactive intra-video and inter-shot navigation with dedicated gestural controllers. MediaCycle, the framework it is built upon, provides media organization by similarity, with a modular architecture enabling most of its workflow to be performed by plugins: feature extraction, clustering, segmentation, summarization, intra-media and inter-segment visualization. MediaCycle focuses on user experience with user interfaces that can be tailored to specific use cases.

Christian Frisson, Stéphane Dupont, Alexis Moinet, Cécile Picard-Limpens, Thierry Ravet, Xavier Siebert, Thierry Dutoit

Interactive Video Retrieval Using Combination of Semantic Index and Instance Search

We present our efficient implementation of interactive video search tool for Known Item Search(KIS) using the combination of Semantic Indexing(SIN) and Instance Search(INS). The interaction way allows users to index a video clip via their knowledge of visual content. Our system offers users a set of concepts and SIN module returns candidate keyframes based on users selection of concepts. Users choose keyframes which contains the interest items, and the INS module recommends frames with similar content to the target clip. Finally, the precise time stamps of the clip are given by the Temporal Refinement(TR).

Hongliang Bai, Lezi Wang, Yuan Dong, Kun Tao


Weitere Informationen