Skip to main content

Über dieses Buch

The two-volume set LNCS 7732 and 7733 constitutes the thoroughly refereed proceedings of the 19th International Conference on Multimedia Modeling, MMM 2012, held in Huangshan, China, in January 2013.

The 30 revised regular papers, 46 special session papers, 20 poster session papers, and 15 demo session papers, and 6 video browser showdown were carefully reviewed and selected from numeroues submissions. The two volumes contain papers presented in the topical sections on multimedia annotation I and II, interactive and mobile multimedia, classification, recognition and tracking I and II, ranking in search, multimedia representation, multimedia systems, poster papers, special session papers, demo session papers, and video browser showdown.



Erratum to: Efficient HEVC to H.264/AVC Transcoding with Fast Intra Mode Decision

The original version of the paper starting on p. 295 was revised. The link in Reference 14 has been exchanged. The origianal chapter was corrected.

Jun Zhang, Feng Dai, Yongdong Zhang, Chenggang Yan

Regular Papers

Multimedia Annotation I

Semi-supervised Concept Detection by Learning the Structure of Similarity Graphs

We present an approach for detecting concepts in images by a graph-based semi-supervised learning scheme. The proposed approach builds a similarity graph between both the labeled and unlabeled images of the collection and uses the Laplacian Eigemaps of the graph as features for training concept detectors. Therefore, it offers multiple options for fusing different image features. In addition, we present an incremental learning scheme that, given a set of new unlabeled images, efficiently performs the computation of the Laplacian Eigenmaps. We evaluate the performance of our approach both on synthetic datasets and on MIR Flickr, comparing it with high-performance state-of-the-art learning schemes with competitive and in some cases superior results.

Symeon Papadopoulos, Christos Sagonas, Ioannis Kompatsiaris, Athena Vakali

Refining Image Annotation by Integrating PLSA with Random Walk Model

In this paper, we present a new method for refining image annotation by integrating probabilistic latent semantic analysis (PLSA) with random walk (RW) model. First, we construct a PLSA model with asymmetric modalities to estimate the posterior probabilities of each annotating keywords for an image, and then a label similarity graph is constructed by a weighted linear combination of label similarity and visual similarity. Followed by a random walk process over the label graph is employed to further mine the correlation of the keywords so as to capture the refining annotation, which plays a crucial role in semantic based image retrieval. The novelty of our method mainly lies in two aspects: exploiting PLSA to accomplish the initial semantic annotation task and implementing random walk process over the constructed label similarity graph to refine the candidate annotations generated by the PLSA. Compared with several state-of-the-art approaches on Corel5k and Mirflickr25k datasets, the experimental results show that our approach performs more efficiently and accurately.

Dongping Tian, Xiaofei Zhao, Zhongzhi Shi

Social Media Annotation and Tagging Based on Folksonomy Link Prediction in a Tripartite Graph

Social tagging has become a popular way for users to annotate, search, navigate and discover online social media, resulting in the sheer amount of metadata collectively generated by people. This paper focuses on two tagging problems—(1) recommending the most suitable tags during a user’s tagging process and (2) labeling latent tags relevant to a social media item—so that social media can be more browsable, searchable, and shareable by users. The proposed approach employs the Katz measure, a path-ensemble based proximity measure, to predict links in a weighted tripartite graph which represents folksonomy. From a graph-based proximity perspective, our method recommends appropriate tags for a given user-item pair, as well as uncovers hidden tags potentially relevant to a given item. We evaluate our method on real-world folksonomy collected from From our experiments, we show that not only does our algorithm outperform existing algorithms, but it can also obtain significant gains in cold start situations where relatively little information is known about a user or an item.

Majdi Rawashdeh, Heung-Nam Kim, Abdulmotaleb El Saddik

Can You See It? Two Novel Eye-Tracking-Based Measures for Assigning Tags to Image Regions

Eye tracking information can be used to assign given tags to image regions in order to describe the depicted scene in more details. We introduce and compare two novel eye-tracking-based measures for conducting such assignments: The segmentation measure uses automatically computed image segments and selects the one segment the user fixates for the longest time. The heat map measure is based on traditional gaze heat maps and sums up the users’ fixation durations per pixel. Both measures are applied on gaze data obtained for a set of social media images, which have manually labeled objects as ground truth. We have determined a maximum average precision of 65% at which the segmentation measure points to the correct region in the image. The best coverage of the segments is obtained for the segmentation measure with a F-measure of 35%. Overall, both newly introduced gaze-based measures deliver better results than baseline measures that selects a segment based on the golden ratio of photography or the center position in the image. The eye-tracking-based segmentation measure significantly outperforms the baselines for precision and F-measure.

Tina Walber, Ansgar Scherp, Steffen Staab

Multimedia Annotation II

Visual Analysis of Tag Co-occurrence on Nouns and Adjectives

In recent years, due to the wide spread of photo sharing Web sites such as Flickr and Picasa, we can put our own photos on the Web and show them to the public easily. To make the photos searched for easily, it is common to add several keywords which are called as “tags” when we upload photos. However, most of the tags are added one by one independently without much consideration of association between the tags. Then, in this paper, as a preparation for realizing simultaneous recognition of nouns and adjectives, we examine visual relationship between tags, particularly noun tags and adjective tags, by analyzing image features of a large number of tagged photos in social media sites on the Web with mutual information. As a result, it was turned out that mutual information between some nouns such as “car” and “sea” and adjectives related to color such as “red” and “blue” was relatively high, which showed that their relations were stronger.

Yuya Kohara, Keiji Yanai

Verb-Object Concepts Image Classification via Hierarchical Nonnegative Graph Embedding

Most existing image classification methods focus on handling images with only “object” concepts. At the same time, in real-world cases, there exists a great variety of images which contain “verb-object” concepts, rather than only “object” ones. The hierarchical structure embedded in these “verb-object” concepts can help to enhance classification. However, traditional feature representing methods cannot utilize it. To tackle this defect, we present in this paper a novel approach, called Hierarchical Nonnegative Graph Embedding (HNGE). By assuming that those “verb-object” concept images which share the same “object” part but different “verb” part have a specific hierarchical structure, we make use of this hierarchical structure and employ an effective technique, named nonnegative graph embedding, to perform feature extraction as well as image classification. Extensive experiments compared with the state-of-the-art algorithms on nonnegative data factorization demonstrate the feasibility, convergency and classification power of proposed approach on “verb-object” concept images classification.

Chao Sun, Bing-Kun Bao, Changsheng Xu

Robust Semantic Video Indexing by Harvesting Web Images

Semantic video indexing, also known as video annotation, video concept detection in literatures, has attracted significant attentions recently. Due to the scarcity of training videos, most existing approaches can scarcely achieve satisfactory performances. This paper proposes a robust semantic video indexing framework, which exploits user-tagged web images to assist learning robust semantic video indexing classifiers. The following two challenges are well studied: (a) domain difference between images and videos; and (b) noisy web images with incorrect tags. Specifically, we first estimate the probabilities of images being correctly tagged as confidence scores and filter out the images with low confidence scores. We then develop a robust image-to-video indexing approach to learn reliable classifiers from a limited number of training videos together with abundant user-tagged images. A robust loss function weighted by the confidence scores of images is used to further alleviate the influence of noisy samples. An optimal kernel space, in which the domain difference between images and videos is minimal, is automatically discovered by the approach to tackle the domain difference problem. Experiments on NUS-WIDE web image dataset and Kodak consumer video corpus demonstrate the effectiveness of the proposed robust semantic video indexing framework.

Yang Yang, Zheng-Jun Zha, Heng Tao Shen, Tat-Seng Chua

Interactive and Mobile Multimedia

Interactive Evaluation of Video Browsing Tools

The Video Browser Showdown (VBS) is a live competition for evaluating video browsing tools regarding their efficiency at known-item search (KIS) tasks. The first VBS was held at MMM 2012 with eight teams working on 14 tasks, of which eight were completed by expert users and six by novices. We describe the details of the competition, analyze results regarding the performance of tools, the differences between the tasks and the nature of the false submissions.

Werner Bailer, Klaus Schoeffmann, David Ahlström, Wolfgang Weiss, Manfred Del Fabro

Paint the City Colorfully: Location Visualization from Multiple Themes

The prevalence of digital photo capturing devices has generated large-scale photos with geographical information, leading to interesting tasks like geographically organizing photos and location visualization. In this work, we propose to organize photos both geographically and thematically, and investigate the problem of location visualization from multiple themes. The novel visualization scheme provides a rich display landscape for location exploration from all-round views. A two-level solution is presented, where we first identify the highly photographed places (POI) and discover their distributed themes, and then aggregate the lower-level themes to generate the higher-level themes for location visualization. We have conducted experiments on a Flickr dataset and exhibited the visualization for the Singapore city. The experimental results have validated the proposed method and demonstrated the potentials of location visualization from multiple themes.

Quan Fang, Jitao Sang, Changsheng Xu, Ke Lu

Interactive Video Advertising: A Multimodal Affective Approach

Online video advertising (video-in-video) strategies are typically agnostic to the video content (ex.: advertising on YouTube) and the human viewer’s preferences. How to assess the emotional state and engagement of the viewer to place an advertisement? Where to insert an advertisement based on the content in an advertisement and a specific target video stream? Surely these are relevant questions that should be addressed by a good model for video advertisement placement. In this paper, we propose a novel framework to address two important aspects of (a) multi-modal affective analysis of video content and viewer behavior (b) a method for interactive personalized advertisement insertion for a single user. Our analysis and framework is backed by a systematic study of literature in marketing, consumer psychology and affective analysis of videos. Results from the user-study experiments demonstrate that the proposed method performs better than the state-of-the-art in video-in-video advertising.

Karthik Yadati, Harish Katti, Mohan Kankanhalli

GPS Estimation from Users’ Photos

Nowadays social media are very popular for people to share their photos with their friends. Many of the photos are geo-tagged (with GPS information) whether automatically or manually. Social media management websites such as Flickr allow users manually labeling their uploaded photos with GPS with the interface of dragging them into the map. However, manually dragging the photos to the map will bring more error and very boring for users to labeling their photos. Thus in this paper, a GPS location estimation approach is proposed. For an uploaded image, its GPS information is estimated by both hierarchical global feature classification and local feature refinement to guarantee the accuracy and computational cost. To guarantee the estimation performances, k-nearest neighbors are selected in global feature classification stage. Experiments show the effectiveness of our proposed approach.

Jing Li, Xueming Qian, Yuan Yan Tang, Linjun Yang, Chaoteng Liu

Knowing Who You Are and Who You Know: Harnessing Social Networks to Identify People via Mobile Devices

With more and more images being uploaded to social networks each day, the resources for identifying a large portion of the world are available. However the tools to harness and utilize this information are not sufficient. This paper presents a system, called PhacePhinder, which can build a face database from a social network and have it accessible from mobile devices. Through combining existing technologies, this is made possible. It also makes use of a fusion probabilistic latent semantic analysis to determine strong connections between users as well as social photos. We demonstrate a working prototype that can identify a face from a picture taken from a mobile phone using a database derived from images gathered directly from a social network and return a meaningful social connection to the recognized face.

Mark Bloess, Heung-Nam Kim, Majdi Rawashdeh, Abdulmotaleb El Saddik

Classification, Recognition and Tracking I

Hyperspectral Image Classification by Using Pixel Spatial Correlation

This paper introduces a hyperspectral image classification approach by using pixel spatial relationship. In hyperspectral images, the spatial relationship among pixels has been shown to be important in the exploration of pixel labels. To better employ the spatial information, we propose to estimate the correlation among pixels in a hypergraph structure. In the constructed hypergraph, each pixel is denoted by a vertex, and the hyperedge is constructed by using the spatial neighbors of each pixel. Semi-supervised learning on the constructed hypergraph is conducted for hyperspectral image classification. Experiments on two datasets are used to evaluate the performance of the proposed method. Comparisons with the state-of-the-art methods demonstrate that the proposed method can effectively investigate the spatial relationship among pixels and achieve better hyperspectral image classification results.

Yue Gao, Tat-Seng Chua

Research on Face Recognition under Images Patches and Variable Lighting

Many classic and contemporary face recognition algorithms work well on public data sets, but degrade sharply when they are used in variations lighting, expressions and images patches situation. New correlation filter designs have shown to be distortion invariant and the advantages of using images are due to the invariance to visible illumination variations. We propose a conceptually simple face recognition system that achieves a high degree of robustness and stability to illumination variation, image patches based in a simple non-linear correlation filter. The proposed technique is based on the premise that the face is an object composed of facial characteristics. The system can efficiently and effectively recognize faces under a variety of realistic conditions, using only frontal images under the proposed illuminations as training, and the results of detection and identification rate of is 96.3% in face identification, while in verification task reaches 94.6%.

Wengang Feng

A New Network-Based Algorithm for Human Group Activity Recognition in Videos

In this paper, a new network-based (NB) algorithm is proposed for human group activity recognition in videos. The proposed NB algorithm introduces three different networks for modeling the correlation among people as well as the correlation between people and the surrounding scene. With the proposed network models, human group activities can be modeled as the package transmission process in the network. Thus, by analyzing the energy consumption situation in these specific “package transmission” processes, various group activities can be effectively detected. Experimental results demonstrate the effectiveness of our proposed algorithm.

Gaojian Li, Weiyao Lin, Sheng Zhang, Jianxin Wu, Yuanzhe Chen, Hui Wei

Exploit Spatial Relationships among Pixels for Saliency Region Detection Using Topic Model

In this paper, we describe an approach to saliency detection as a two-category (salient or not) soft clustering using topic model. In order to simulate human’s paralleled visual neural perception, many sub-regions are sampling from an image, where each one is considered as a set of colors from a codebook, which is a color palette for the image. We assume salient pixels would appear spatial adjacent more possibly, therefore in a same sub-region, while less salient pixels would either. Consequently, all the sub-regions are clustered into two assumed topics with probabilities: “salient”/“non-salient”, while “salient” one is decided to give saliency value of each pixel according to its posterior conditional probability. Our method will give a global saliency map with full resolution, and experiments illustrate it is competitive with the state-of-art methods.

Guang Jiang, Xi Liu, JinPeng Yue, Zhongzhi Shi

Classification, Recognition and Tracking II

Mining People’s Appearances to Improve Recognition in Photo Collections

We show how to recognize people in Consumer Photo Collections by employing a graphical model together with a distance-based face description method. To further improve recognition performance, we incorporate context in the form of social semantics. We devise an approach that has a data mining technique at its core to discover and incorporate patterns of groups of people frequently appearing together in photos. We demonstrate the effect of our probabilistic approach through experiments on a dataset that spans nearly ten years.

Markus Brenner, Ebroul Izquierdo

Person Re-identification by Local Feature Based on Super Pixel

In many multi-camera surveillance systems, there is a need to identify whether a captured person have emerged before over the network of cameras. This is the person re-identification problem. In this paper, we propose a novel re-identification method based on super pixel feature. Firstly, local C-SIFT features based on super pixel are extracted as visual words, and appearance details are used to describe detecting objects. Secondly, a TF-IDF vocabulary index tree is built to speed up person search. Finally, an image-retrieval way is adopted to implement person re-identification. Experimental results on ETHZ dataset show that our method is better than the approach proposed by Schwartz and two machine learning methods based on SVM and PCA.

Cheng Liu, Zhicheng Zhao

An Effective Tracking System for Multiple Object Tracking in Occlusion Scenes

In this paper, we propose an effective multi-object tracking system which can handle the partial occlusion in the tracking process. First, this method employs the part-based model to localize the person and body parts in every frame. Then it leverages the motion characteristics of both parts and the entire body to generate the trajectories of individuals. To overcome the difficulty in partial occlusion, we propose to formulate the task of multi-object tracking into multi-object matching with body part cues. The large scale comparison experiment on the popular tracking datasets demonstrates the superiority of the proposed method.

Weizhi Nie, Anan Liu, Yuting Su, Zan Gao

Ranking in Search

Image Search Reranking with Semi-supervised LPP and Ranking SVM

Learning to rank is one of the most popular ranking methods used in image retrieval and search reranking. However, the high-dimension of the visual features usually causes the problem of “curse of dimensionality”. Dimensionality reduction is one of the key steps to overcome these problems. However, existing dimensionality reduction methods are typically designed for classification, but not for ranking tasks. Since they do not utilize ranking information such as relevance degree labels, direct utilization of conventional dimensionality reduction methods in ranking applications generally cannot achieve the best performance. In this paper, we study the task of image search reranking, and propose a novel system scheme based on Locality Preserving Projections (LPP) and RankingSVM. And further, in the proposed scheme, we improve LPP by incorporating the relevance degree information into it. Since this kind of method can use the information of labeled and unlabeled data, we name it as semi-supervised LPP (Semi-LPP). Experiments on the popular MSRA-MM dataset demonstrate the superiority of the proposed scheme and Semi-LPP method in image search reranking application.

Zhong Ji, Yanru Yu, Yuting Su, Yanwei Pang

Co-ranking Images and Tags via Random Walks on a Heterogeneous Graph

Ranking on image search results has attracted considerable attentions. Despite many graph-based ranking algorithms have demonstrated remarkable success, most of their applications are limited to single image-networks such as the network of tags associated with images. In this paper, we investigate the problem of co-ranking images and tags attached in a heterogeneous network, which consists of three graphs: the image graph connecting images, the tag graph connecting tags attached to the images, as well as the image-tag graph connecting the above two graphs together. Observing that existing ranking approaches do not consider images and tags simultaneously, a novel co-ranking method via random walks on all three graphs is proposed to significantly improve the ranking effectiveness on both images and tags. Experimental results conducted on three benchmark data sets show that our approach outperforms the state-of-the-art local ranking approaches for image ranking and tag ranking and scales well on large scale data sets.

Lin Wu, Yang Wang, John Shepherd

Social Visual Image Ranking for Web Image Search

Many research have been focusing on how to match the textual query with visual images and their surrounding texts or tags for Web image search. The returned results are often unsatisfactory due to their deviation from user intentions. In this paper, we propose a novel image ranking approach to web image search, in which we use social data from social media platform jointly with visual data to improve the relevance between returned images and user intentions (i.e., social relevance). Specifically, we propose a community-specific Social-Visual Ranking(SVR) algorithm to rerank the Web images by taking social relevance into account. Through extensive experiments, we demonstrated the importance of both visual factors and social factors, and the effectiveness and superiority of the social-visual ranking algorithm for Web image search.

Shaowei Liu, Peng Cui, Huanbo Luan, Wenwu Zhu, Shiqiang Yang, Qi Tian

Multimedia Representation

Fusion of Audio-Visual Features and Statistical Property for Commercial Segmentation

Commercial segmentation is a primary step of commercial management which is an emerging technology. Relative to general video scene segmentation, commercial segmentation is particular because of dramatic changes in acoustic effect and chromatic composition. Conventional algorithms emphasize on utilizing new audio and visual features to adapt with change over time. In this paper, we have proposed a novel scheme to fuse audio-visual characteristics and statistical property of commercial length to find individual commercial boundaries. First, mid-level descriptors such as Static Shot with Product Information (SSPI) are used to predict the likelihoods of commercial boundary for every shot boundary. And then, Dynamic Programming (DP) refiner with Distribution of Individual Commercial Length (DICL) constraint is applied to find the optimal path of a Markov Chain of these shot boundaries. Experiments on simulated and real datasets show promising results.

Bo Zhang, Bailan Feng, Bo Xu

Learning Affine Robust Binary Codes Based on Locality Preserving Hash

In large scale vision applications, high-dimensional descriptors extracted from image patches are in large quantities. Thus hashing methods that compact descriptors to binary codes have been proposed to achieve practical resource consumption. Among these methods, unsupervised hashing aims to preserve Euclidean distances, which do not correlate well with the similarity of image patches. Supervised hashing methods exploit labeled data to learn binary codes based on visual or semantic similarity, which are usually slow to train and consider global structure of data. When data lie on a sub-manifold, global structure can not reflect the inherent structure well and may lead to incompact codes. We propose locality preserving hash (LPH) to learn affine robust binary codes. LPH preserves local structure by embedding data into a sub-manifold, and performing binarization that minimize false classification ratio while keeping partition balanced. Experiments on two datasets show that LPH is easy to train and performs better than state-of-the-art methods with more compact binary codes.

Wei Zhang, Ke Gao, Dongming Zhang, Jintao Li

A Novel Segmentation-Based Video Denoising Method with Noise Level Estimation

Most of the state of the art video denoising algorithms consider additive noise model, which is often violated in practice. In this paper, two main issues are addressed, namely, segmentation-based block matching and the estimation of noise level. Different with the previous block matching methods, we present an efficient algorithm to perform the block matching in spatially-consistent segmentations of each image frame. To estimate the noise level function (NLF), which describes the noise level as a function of image brightness, we propose a fast bilateral medial filter based method. Under the assumption of short-term coherence, this estimation method is consequently extended from single frame to multi-frames. Coupling these two techniques together, we propose a segmentation-based customized BM3D method to remove colored multiplicative noise for videos. Experimental results on benchmark data sets and real videos show that our method significantly outperforms the state of the art in removing the colored multiplicative noise.

Shijie Zhang, Jing Zhang, Zhe Yuan, Shuai Fang, Yang Cao

Multimedia Systems

Blocking Artifact Reduction in DIBR Using an Overcomplete 3D Dictionary

It has been reported that the image quality and depth perception rates are undesirably decreased by compression in DIBR. This is because high-frequency components are filtered by compression, and thus several compression artifacts occur. Blocking artifacts are the most representative ones which seriously degrade the picture quality, and are annoying to viewers. The fundamental cause of the coding artifacts is that the 3D contents are delivered to users over the existing 2D broadcast infrastructures which employ BDCT coding as image compression techniques. In this paper, we propose a new deblocking method in 3D images, i.e., video-plus-depth, using an overcomplete 3D dictionary. We generate the 3D dictionary from natural and depth images using the k-singular value decomposition (K-SVD) algorithm, and estimate an error threshold to utilize the 3D dictionary using a compression factor of the compressed images. Experimental results demonstrate that the proposed method is very effective in reducing undesirable blocking artifacts in 3D images.

Cheolkon Jung, Licheng Jiao, Hongtao Qi

Efficient HEVC to H.264/AVC Transcoding with Fast Intra Mode Decision

High Efficiency Video Coding (HEVC) standard will soon reach its final draft. To provide the widely deployed H.264/AVC devices with HEVC video contents, transcoding pre-encoded HEVC video into H.264/AVC format is highly necessary. Computational complexity of H.264 hinders real-time transcoding. In this paper, we propose an efficient HEVC to H.264 intra frame transcoder to accelerate the time-consuming H.264 intra mode decision while ensure rate distortion (RD) performance. The proposed transcoder incorporates a support vector machine (SVM) based macroblock (MB) partition mode decision and a fast prediction mode decision. Compared with the reference transcoder which employs exhaustive search mode decision, our proposed transcoder can save 68.83% of transcoding time with negligible 2.32% bit-rate increase on average.

Jun Zhang, Feng Dai, Yongdong Zhang, Chenggang Yan

SSIM-Based End-to-End Distortion Model for Error Resilient Video Coding over Packet-Switched Networks

Conventional end-to-end distortion models measure the overall distortion based on independent estimation of the source distortion and channel distortion. However, they are not correlating well with perceptual characteristics in which a strong dependency exists among the source distortion, channel distortion and video content. As most compressed videos are represented to human users, perception-based end-to-end distortion model should be developed for error resilient video coding. In this paper, we propose a SSIM-based end-to-end distortion model to optimally estimate the overall perceptual distortion due to quantization, error concealment and error propagation. Experiments show that the proposed end-to-end distortion model can bring significant visual quality improvement for H.264/AVC video coding over packet-switched networks.

Lei Zhang, Qiang Peng, Xiao Wu

A Novel and Robust System for Time Recognition of the Digital Video Clock Using the Domain Knowledge

This paper presents a novel and robust system for recognizing the time of the digital video clock by using the domain knowledge. This system comprises a set of the functions for the time recognition so that the user can conveniently use these functions to recognize the time of digital video clocks or to execute some steps of recognizing time. These functions are novel and robust because they use the novel methods derived from the domain knowledge of the digital video clock. These methods are region second periodicity, global maximum model, digit location model, digit-sequence recognition, and on-the-fly SVM. Experimental results show that both the functions and the system can achieve very high recognition accuracy.

Xinguo Yu, Tie Rong, Lin Li, Hon Wai Leong

On Modeling 3-D Video Traffic

Today’s 3-D video has become feasible on consumer electronics platforms through advances in display technology, signal processing, transmission technology, circuit design and computer power. In order to develop research studies on network transport of 3-D video, adequate traffic models are necessary. Particularly, the efficient and on-line generation of synthetic paths is fundamental for simulation studies. In this work we check the suitability of the M/G/∞ process for modeling the correlation structure of 3-D video.

M. E. Sousa-Vieira

Posters Papers

A Low-Complexity Quantization-Domain H.264/SVC to H.264/AVC Transcoder with Medium-Grain Quality Scalability

Scalable Video Coding (SVC) aiming to provide the ability to adapt to heterogeneous requirements. It offers great flexibility for bitstream adaptation in multi-point applications. However, transcoding between SVC and AVC is necessary due to the existence of legacy AVC-based systems. This paper proposes a fast SVC-to-AVC MGS (Medium-Grain quality Scalability) transcoder. A quantization-domain transcoding architecture is proposed for transcoding non-KEY pictures in MGS. KEY pictures are transcoded by drift-free architecture so that error propagation is constrained. Simulation results show that proposed transcoder achieves averagely 37 times speed-up compared with the re-encoding method with acceptable coding efficiency loss.

Lei Sun, Zhenyu Liu, Takeshi Ikenaga

Evaluation of Product Quantization for Image Search

Product quantization is an effective quantization scheme, with that a high-dimensional space is decomposed into a Cartesian product of low-dimensional subspaces, and quantization in different subspaces is conducted separately. We briefly discuss the factors for designing a product quantizer, and then design experiments to comprehensively investigate how these factors influence performance of image search. By this evaluation we reveal design principles that have not been well investigated before.

Wei-Ta Chu, Chun-Chang Huang, Jen-Yu Yu

Rate-Quantization and Distortion-Quantization Models of Dead-Zone Plus Uniform Threshold Scalar Quantizers for Generalized Gaussian Random Variables

This paper presents the rate-distortion modeling of the dead-zone plus uniform threshold scalar quantizers with nearly-uniform reconstruction quantizers (DZ+UTSQ/NURQ) for generalized Gaussian distribution (GGD). First, we rigorously deduce the analytical rate-quantization (R-Q) and distortion-quantization (D-Q) functions of DZ+UTSQ/NURQ for Laplacian distribution (an important special case of GGD). Then we heuristically extend these results and obtain the R-Q and D-Q models for GGD under DZ+UTSQ/NURQ. The effectiveness of the proposed GGD R-Q and D-Q models is well confirmed from low to high bit rate via extensive simulation experiments, promising the efficiency and accuracy to guide various video applications in practice.

Yizhou Duan, Jun Sun, Zongming Guo

Flexible Presentation of Videos Based on Affective Content Analysis

The explosion of multimedia contents has resulted in a great demand of video presentation. While most previous works focused on presenting certain type of videos or summarizing videos by event detection, we propose a novel method to present general videos of different genres based on affective content analysis. We first extract rich audio-visual affective features and select discriminative ones. Then we map effective features into corresponding affective states in an improved categorical emotion space using hidden conditional random fields (HCRFs). Finally we draw affective curves which tell the types and intensities of emotions. With the curves and related affective visualization techniques, we select the most affective shots and concatenate them to construct affective video presentation with a flexible and changeable type and length. Experiments on representative video database from the web demonstrate the effectiveness of the proposed method.

Sicheng Zhao, Hongxun Yao, Xiaoshuai Sun, Xiaolei Jiang, Pengfei Xu

Dynamic Multi-video Summarization of Sensor-Rich Videos in Geo-Space

User generated videos are much easier to be produced today due to the progress in camera technology on mobile devices. The ubiquitous built-in sensors in digital devices greatly enrich these videos with sensor descriptions, especially geo-spatial properties. A repository of such sensor-rich videos can be a great source of information for prospective tourists when they plan to visit a city and would like to get a preview of its main areas.

In this study we propose an interactive geo-video search system. When a user specifies a start point and a destination (




., on a map), the system dynamically retrieves a video summarization along the path between the two points. Moreover, the query can be interactively updated during the video playback, by changing either the tour path or the target destination. The main features of our technique are, first, that it is fully automatic and leverages sensor meta-data information which is acquired in conjunction with videos. Second, the system dynamically adapts to query updates in real-time, and no prior knowledge is required by users. Third, a concise but comprehensive summarization from multiple user generated videos is proposed for any queried route. Finally, the system incrementally adapts to the latest contributions to the video repository.

Ying Zhang, He Ma, Roger Zimmermann

Towards Automatic Music Performance Comparison with the Multiple Sequence Alignment Technique

In this paper, we propose an approach towards automatic music performance comparison based on the multiple sequence alignment technique. In this approach, the onset detection technique is first applied to the multi-version recordings of the same music work. The signal between two adjacent onsets is represented with its corresponding chroma feature vector and symbolized as a chroma symbol. Thus a piece of music signal can be transformed into its associated chroma string. The progressive multiple sequence alignment technique is applied to these chroma strings to find a global alignment for multiple performances. After these chroma strings are aligned, dynamics and tempo comparisons among the multi-version performances can be carried out in various scale such as a note, a phrase, or the whole song. Nine versions of CD recordings on Sonatas and Partitas for Violin Solo, composed by Johann Sebastian Bach, are selected as the data set for the experiments. A phynogenetic tree for the nine performances can be automatically generated based on the distance matrix of their aligned chroma strings.

Chih-Chin Liu

Multi-frame Super Resolution Using Refined Exploration of Extensive Self-examples

The multi-frame super resolution (SR) problem is to generate high resolution (HR) images by referring to a sequence of low resolution (LR) images. However, traditional multi-frame SR methods fail to take full advantage of the redundancy in LR images. In this paper, we present a novel algorithm using a refined example-based SR framework to cope with this problem. The refined framework includes two innovative points. First, based upon a thorough study of multi-frame and single frame statistics, we extend the single frame example-based scheme to multi-frame. Instead of training an external dictionary, we search for examples in the image pyramids of the LR inputs,


, a set of multi-resolution images derived from the input LRs. Second, we propose a new metric to find similar image patches, which not only considers the intensity and structure features of a patch but also adaptively balances between these two parts. With the refined framework, we are able to make the utmost of the redundancy in LR images to facilitate the SR process. As can be seen from the experiments, it is efficient in preserving structural features. Experimental results also show that our algorithm outperforms state-of-the-art methods on test sequences, achieving the average PSNR gain by up to 1.2dB.

Wei Bai, Jiaying Liu, Mading Li, Zongming Guo

Iterative Super-Resolution for Facial Image by Local and Global Regression

In this paper, we propose an iterative framework to super-resolve the facial image from a single low-resolution (LR) input. To retrieve local and global information, we first model two linear regressions for the local patch and global face, respectively. In both regression models, we restrict the responses of the regressors under the considerations of facial property and discriminability. Since the responses estimated from the LR training samples can be directly applied to the (high-resolution) HR training ones, the restricted linear regressions essentially describe the desired output. More specifically, the local regression reveals the facial details, and the global regression characterizes the features of overall face. The final results are obtained by alternately using two regressions. Experimental results show the superiority of the proposed method over some state-of-the-art methods.

Fei Zhou, Biao Wang, Wenming Yang, Qingmin Liao

Stripe Model: An Efficient Method to Detect Multi-form Stripe Structures

We present a general mathematical model for multiple forms of stripes. Based on the model, we propose a method to detect stripes built on scale-space. This method generates difference of Gaussian (DoG) maps by subtracting neighbor Gaussian layers, and reserves extremal responses in each DoG map by comparing to its neighbors. Candidate stripe regions are then formed from connected extremal responses. After that, approximate centerlines of stripes are extracted from candidate stripe regions using non-maximum suppression, which eliminates undesired edge responses simultaneously. And stripe masks could be restored from those centerlines with the estimated stripe width. Owing to the ability of extracting candidate regions, our method avoids traversing to do costly directional calculation on all pixels, so it is very efficient. Experiments show the robustness and efficiency of the proposed method, and demonstrate its ability to be applied to different kinds of applications in the image processing stage.

Yi Liu, Dongming Zhang, Junbo Guo, Shouxun Lin

Saliency-Based Content-Aware Image Mosaics

In this paper, we propose a novel content-aware image mosaic approach based on image saliency. The image saliency is used in the whole process of creating an image mosaic with variable size tiles, while a novel energy map is proposed by combining Neighborhood Inhomogeneity Factor and Graph-Based Visual Saliency. The target image is divided into small tiles with variable sizes based on the energy map. Image retrieval is introduced to choose the tile images from a certain database. Considering the specialization of tile image retrieval, we propose a new feature representation called brightness distribution vector, which indicates the image global brightness distribution. Extensive experiments are conducted to show that the proposed approach creates better mosaics in visual aspect than the conventional methods.

Dongyan Guo, Jinhui Tang, Jundi Ding, Chunxia Zhao

Combining Visual and Textual Systems within the Context of User Feedback

It has been proven experimentally, that a combination of textual and visual representations can improve the retrieval performance ([20], [23]). It is due to the fact, that the textual and visual feature spaces often represent complementary yet correlated aspects of the same image, thus forming a composite system.

In this paper, we present a model for the combination of visual and textual sub-systems within the user feedback context. The model was inspired by the measurement utilized in quantum mechanics (QM) and the tensor product of co-occurrence (density) matrices, which represents a density matrix of the composite system in QM. It provides a sound and natural framework to seamlessly integrate multiple feature spaces by considering them as a composite system, as well as a new way of measuring the relevance of an image with respect to a context. The proposed approach takes into account both intra (via co-occurrence matrices) and inter (via tensor operator) relationships between features’ dimensions. It is also computationally cheap and scalable to large data collections. We test our approach on ImageCLEF2007photo data collection and present interesting findings.

Leszek Kaliciak, Dawei Song, Nirmalie Wiratunga, Jeff Pan

A Psychophysiological Approach to the Usability Evaluation of a Multi-view Video Browsing Tool

The aim of this study is to investigate the usability of a video browsing tool. The tool aims at facilitating content navigation and selection in media post-production, supporting also multi-view content. Psychophysiological measures such as skin conductance level are used to measure cognitive effort. Objective measures based on content retrieval tasks as well as self-report measures of usability are also reported. Results indicate the differential effect of introducing specific support for multi-view content in the browsing tool, and encourage further research on the use of psychophysiological techniques in usability evaluations.

Carmen Martinez-Peñaranda, Werner Bailer, Miguel Barreda-Ángeles, Wolfgang Weiss, Alexandre Pereda-Baños

Film Comic Generation with Eye Tracking

Automatic generation of film comic requires solving several challenging problems such as selecting important frames well conveying the whole story, trimming the frames to fit the shape of panels without corrupting the composition of original image and arranging visually pleasing speech balloons without hiding important objects in the panel. We propose a novel approach to the automatic generation of film comic. The key idea is to aggregate eye-tracking data and image features into a computational map, called iMap, for quantitatively measuring the importance of frames in terms of story content and user attention. The transition of iMap in time sequences provides the solution to frame selection. Word balloon arrangement and image trimming are realized as the results of optimizing the energy functions derived from the iMap.

Tomoya Sawada, Masahiro Toyoura, Xiaoyang Mao

Quality Assessment of User-Generated Video Using Camera Motion

With user-generated video (UGV) becoming so popular on the Web, the availability of a reliable quality assessment (QA) measure of UGV is necessary for improving the users’ quality of experience in video-based application. In this paper, we explore QA of UGV based on how much irregular camera motion it contains with low-cost manner. A block-match based optical flow approach has been employed to extract camera motion features in UGV, based on which, irregular camera motion is calculated and automatic QA scores are given. Using a set of UGV clips from benchmarking datasets as a showcase, we observe that QA scores from the proposed automatic method and subjective method fit well. Further, the automatic method reports much better performance than the random run. These confirm the satisfaction of the automatic QA scores indicating the quality of the UGV when only considering visual camera motion. Furthermore, it also shows that the UGV quality can be assessed automatically for improving the end users quality of experience in video-based applications.

Jinlin Guo, Cathal Gurrin, Frank Hopfgartner, Zhenxing Zhang, Songyang Lao

Multiscaled Cross-Correlation Dynamics on SenseCam Lifelogged Images

In this paper, we introduce and evaluate a novel approach, namely the use of the cross correlation matrix and Maximum Overlap Discrete Wavelet Transform (MODWT) to analyse SenseCam lifelog data streams. SenseCam is a device that can automatically record images and other data from the wearer’s whole day. It is a significant challenge to deconstruct a sizeable collection of images into meaningful events for users. The cross-correlation matrix was used, to characterise dynamical changes in non-stationary multivariate SenseCam images. MODWT was then applied to equal-time Correlation Matrices over different time scales and used to explore the granularity of the largest Eigenvalue and changes, in the ratio of the sub-dominant Eigenvalue spectrum dynamics, over sliding time windows. By examination of the eigenspectrum, we show that these approaches can identify “Distinct Significant Events” for the wearers. The dynamics of the Eigenvalue spectrum across multiple scales provide useful insight on details of major events in SenseCam logged images.

N. Li, M. Crane, H. J. Ruskin, Cathal Gurrin

Choreographing Amateur Performers Based on Motion Transfer between Videos

We propose a technique for quickly and easily choreographing a video of an amateur performer by comparing it with a video of a corresponding professional performance. Our method allows the user to interactively edit the amateur performance in order to synchronize it with the professional performance in terms of timings and poses. In our system, the user first extracts the amateur and professional poses from every frame via semi-automatic video tracking. The system synchronizes the timings by computing dynamic time warping (DTW) between the two sets of video-tracking data, and then synchronizes the poses by applying image deformation to every frame. To eliminate unnatural vibrations, which often result from inaccurate video tracking, we apply an automatic motion-smoothing algorithm to the synthesized animation. We demonstrate that our method allows the user to successfully edit an amateur’s performance into a more polished one, utilizing the Japanese sumo wrestling squat, the karate kick, and the moonwalk as examples.

Kenta Mizui, Makoto Okabe, Rikio Onai

Large Scale Image Retrieval with Practical Spatial Weighting for Bag-of-Visual-Words

Most large scale image retrieval systems are based on Bag-of-Visual-Words (BoV). Typically, no spatial information about the visual words is used despite the ambiguity of visual words. To address this problem, we introduce a spatial weighting framework for BoV to encode spatial information inspired by Geometry-preserving Visual Phrases (GVP). We first interpret GVP method using this framework. We reveal that GVP gives too large spatial weighting when calculating L2-norm for images due to its implicit assumption of the independence of co-occurring GVPs. This makes GVP sensitive to images with small number of visual words. Then we propose an improved practial spatial weighting for BoV (PSW-BoV) to alleviate this effect while keep the efficiency. Experiments on Oxford 5K and MIR Flickr 1M show that PSW-BoV is robust to images with small number of visual words, and also improves the general retrieval accuracy.

Fangyuan Wang, Hai Wang, Heping Li, Shuwu Zhang

Music Retrieval in Joint Emotion Space Using Audio Features and Emotional Tags

Emotion-based music retrieval provides a natural and humanized way to help people experience music. In this paper, we utilize the three-dimensional Resonance-Arousal-Valence emotion model to represent the emotions invoked by music, and the relationship between acoustic features and their emotional impact based on this model is established. In addition, we also consider the emotional tag features for music, and then represent acoustic features and emotional tag features jointly in a low dimensional embedding space for music emotion, while the joint emotion space is optimized by minimizing the joint loss of acoustic features and emotional tag features through dimension reduction. Finally we construct a unified framework for music retrieval in joint emotion space by the means of query-by-music or query-by-tag or together, and then we utilize our proposed ranking algorithm to return an optimized ranked list that has the highest emotional similarity. The experimental results show that the joint emotion space and unified framework can produce satisfying results for emotion-based music retrieval.

James J. Deng, C. H. C. Leung

Analyzing Favorite Behavior in Flickr

Liking or marking an object, event, or resource as a favorite is one of the most pervasive actions in social media. This particular action plays an important role in platforms in which a lot of content is shared. In this paper we take a large sample of users in Flickr and analyze logs of their favorite actions considering factors such as time period, type of connection with the owner of the photo, and other aspects. The objective of our work is, on one hand to gain insights into the “liking” behavior in social media, and on the other hand, to inform strategies for recommending items users may like. We place particular focus on analyzing the relationship between recent photos uploaded by user’s connections and the favorite action, noting that a direct application of our work would lead to algorithms for recommending users a subset of these “recently uploaded” photos that they might favorite. We compare several features derived from our analysis, in terms of how effective they might be in retrieving favorite photographs.

Marek Lipczak, Michele Trevisiol, Alejandro Jaimes

Unequally Weighted Video Hashing for Copy Detection

In this paper, an unequally weighted video hashing algorithm is presented, in which visual saliency is used to generate the video hash and weight different hash bits. The proposed video hash is fused by two hashes, which are the spatio-temporal hash (ST-Hash) generated according to the spatio-temporal video information and the visual hash (V-Hash) generated according to the visual saliency distribution. In order to emphasize the contribution of visual salient regions to video content, Weighted Error Rate (WER) is defined as an unequally weighted hash matching method to take the place of BER. The WER, unlike BER, gives hash bits unequal weights according to their corresponding visual saliency in hash matching. Experiments verify the robustness and discrimination of the proposed video hashing algorithm and show that the WER-based hash matching is helpful to achieve better precision rate and recall rate.

Jiande Sun, Jing Wang, Hui Yuan, Xiaocui Liu, Ju Liu


Weitere Informationen