Skip to main content

2016 | Buch

MultiMedia Modeling

22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part I

herausgegeben von: Qi Tian, Nicu Sebe, Guo-Jun Qi, Benoit Huet, Richang Hong, Xueliang Liu

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 9516 and LNCS 9517 constitutes the refereed proceedings of the 22nd International Conference on Multimedia Modeling, MMM 2016, held in Miami, FL, USA, in January 2016.

The 32 revised full papers and 52 poster papers presented were carefully reviewed and selected from 117 submissions. In addition 20 papers were accepted for five special sessions out of 38 submissions as well as 7 demonstrations (from 11 submissions) and 9 video showcase papers.

The papers are organized in topical sections on video content analysis, social media analysis, object recognition and system, multimedia retrieval and ranking, multimedia representation, machine learning in multimedia, and interaction and mobile. The special sessions are: good practices in multimedia modeling; semantics discovery from multimedia big data; perception, aesthetics, and emotion in multimedia quality modeling; multimodal learning and computing for human activity understanding; and perspectives on multimedia analytics.

Inhaltsverzeichnis

Frontmatter

Regular Papers

Frontmatter
Video Event Detection Using Kernel Support Vector Machine with Isotropic Gaussian Sample Uncertainty (KSVM-iGSU)

In this paper, we propose an algorithm that learns from uncertain data and exploits related videos for the problem of event detection; related videos are those that are closely associated, though not fully depicting the event of interest. In particular, two extensions of the linear SVM with Gaussian Sample Uncertainty are presented, which (a) lead to non-linear decision boundaries and (b) incorporate related class samples in the optimization problem. The resulting learning methods are especially useful in problems where only a limited number of positive and related training observations are provided, e.g., for the 10Ex subtask of TRECVID MED, where only ten positive and five related samples are provided for the training of a complex event detector. Experimental results on the TRECVID MED 2014 dataset verify the effectiveness of the proposed methods.

Christos Tzelepis, Vasileios Mezaris, Ioannis Patras
Video Content Representation Using Recurring Regions Detection

In this work we present an approach for video content representation based on the detection of recurring visual elements or regions. We hypothesize that such elements play a potentially central role in the underlying video sequence. The approach makes use of fundamental intrinsic properties of a video and, thus, it does not make any assumptions about the video content itself. Furthermore, our approach does not require for any training or prior knowledge about the general settings and video domain. Preliminary experiments with a small and heterogeneous dataset of web videos demonstrate the potential of the approach to be employed as a compact summary of the video content with focus on its central visual elements. Additionally, resulting representations enable the retrieval of video sequences sharing common visual elements.

Lukas Diem, Maia Zaharieva
Group Feature Selection for Audio-Based Video Genre Classification

The performance of video genre classification approaches strongly depends on the selected feature set. Feature selection requires for expert knowledge and is commonly driven by the underlying data, investigated video genres, and previous experience in related application scenarios. An alteration of the genres of interest results in reconsideration of the employed features by an expert. In this work, we introduce an unsupervised method for the selection of features that efficiently represent the underlying data. Performed experiments in the context of audio-based video genre classification demonstrate the outstanding performance of the proposed approach and its robustness across different video datasets and genres.

Gerhard Sageder, Maia Zaharieva, Christian Breiteneder
Computational Cartoonist: A Comic-Style Video Summarization System for Anime Films

This paper presents Computational Cartoonist, a comic-style anime summarization system that detects key frame and generates comic layout automatically. In contract to previous studies, we define evaluation criteria based on the correspondence between anime films and original comics to determine whether the result of comic-style summarization is relevant. To detect key frame detection for anime films, the proposed system segments the input video into a series of basic temporal units, and computes frame importance using image characteristics such as motion. Subsequently, comic-style layouts are decided on the basis of pre-defined templates stored in a database. Several results demonstrate the efficiency of our key frame detection over previous methods by evaluating the matching accuracy between key frames and original comic panels.

Tsukasa Fukusato, Tatsunori Hirai, Shunya Kawamura, Shigeo Morishima
Exploring the Long Tail of Social Media Tags

There are millions of users who tag multimedia content, generating a large vocabulary of tags. Some tags are frequent, while other tags are rarely used following a long tail distribution. For frequent tags, most of the multimedia methods that aim to automatically understand audio-visual content, give excellent results. It is not clear, however, how these methods will perform on rare tags. In this paper we investigate what social tags constitute the long tail and how they perform on two multimedia retrieval scenarios, tag relevance and detector learning. We show common valuable tags within the long tail, and by augmenting them with semantic knowledge, the performance of tag relevance and detector learning improves substantially.

Svetlana Kordumova, Jan van Gemert, Cees G. M. Snoek
Visual Analyses of Music Download History: User Studies

Users’ download history is a primary data source for analyzing user interests. Recent work has shown that user interests are indeed time varying, and accurate profiling of user interest drifts requires the temporal dynamic analyses. We have proposed a visualization approach to analyzing user interest drifts from the download history, taking music as an example, and studied how to depict the underlying relevances among the downloaded music items to identify the drifts. We designed three new kinds of plots to display the music download history of one user, namely Bean plot, Transitional Pie plot, and Instrument plot. In this paper, we report our conducted user studies that ask normal users to visually analyze the download history of other users in a given real-world data set. User studies are performed in a learning-practice-test workflow. The results demonstrate the feasibility of our visualization design.

Dong Liu, Jingxian Zhang
Personalized Annotation for Mobile Photos Based on User’s Social Circle

For mobile photos annotation, users are more interested in the context information behind the photos. The user’s social circle can provide valuable information for it. However, the accompanying textual information of social network is sparse and ambiguous in nature. In this paper, we propose a personalized annotation framework for mobile photos leveraging the user’s social circle. To address the unreliability problem of social network, we present an algorithm to generate reliable tags for social photos before assigning tags to the user’s unlabeled photos. In the tag generation stage, a multi-modality hierarchical clustering algorithm is performed to detect social events. Besides, we use “Album” instead of individual photo as the basic unit for clustering. Finally, we employ a weighted nearest neighbor model for label propagation. We evaluate our framework on a large-scale, real-world dataset from Renren, the largest Facebook-like social network in China. Our evaluation results show promising results of our proposed framework.

Yanhui Hong, Tiandi Chen, Kang Zhang, Lifeng Sun
Utilizing Sensor-Social Cues to Localize Objects-of-Interest in Outdoor UGVs

A huge number of outdoor user-generated videos (UGVs) are recorded daily due to the popularity of mobile intelligent devices. Managing these videos is a tough challenge in multimedia field. In this paper, we tackle this problem by performing object-of-interest (OOI) recognition in UGVs to identify semantically important regions. By leveraging geo-sensor and social data, we propose a novel framework for OOI recognition in outdoor UGVs. Firstly, the OOI acquisition is conducted to obtain an OOI frame set from UGVs. Simultaneously, the classified object set recommendation is performed to obtain a candidate category name set from social networks. Afterward, a spatial pyramid representation is deployed to describe social objects from images and OOIs from UGVs, respectively. Finally, OOIs with their annotated names are labeled in UGVs. Extensive experiments in outdoor UGVs from both Nanjing and Singapore demonstrated the competitiveness of our approach.

Yingjie Xia, Luming Zhang, Liqiang Nie, Wenjing Geng
NEWSMAN: Uploading Videos over Adaptive Middleboxes to News Servers in Weak Network Infrastructures

An interesting recent trend, enabled by the ubiquitous availability of mobile devices, is that regular citizens report events which news providers then disseminate, e.g., CNN iReport. Often such news are captured in places with very weak network infrastructures and it is imperative that a citizen journalist can quickly and reliably upload videos in the face of slow, unstable, and intermittent Internet access. We envision that some middleboxes are deployed to collect these videos over energy-efficient short-range wireless networks. Multiple videos may need to be prioritized, and then optimally transcoded and scheduled. In this study we introduce an adaptive middlebox design, called NEWSMAN, to support citizen journalists. NEWSMAN jointly considers two aspects under varying network conditions: (i) choosing the optimal transcoding parameters, and (ii) determining the uploading schedule for news videos. We design, implement, and evaluate an efficient scheduling algorithm to maximize a user-specified objective function. We conduct a series of experiments using trace-driven simulations, which confirm that our approach is practical and performs well. For instance, NEWSMAN outperforms the existing algorithms (i) by 12 times in terms of system utility (i.e., sum of utilities of all uploaded videos), and (ii) by 4 times in terms of the number of videos uploaded before their deadline.

Rajiv Ratn Shah, Mohamed Hefeeda, Roger Zimmermann, Khaled Harras, Cheng-Hsin Hsu, Yi Yu
Computational Face Reader

The long-history Chinese anthroposcopy has demonstrated the often satisfying capabilities to tell the characteristics (mostly exaggerated as fortune) of a person by reading his/her face, i.e. understanding the fine-grained facial attributes (e.g. single/double-fold eyelid, position of mole). In this paper, we study the face-reading problem from the computer vision perspective and present a computational face reader to automatically infer the characteristics of a person based on his/her face. For example, it can estimate the attractive and easy-going characteristics of a Chinese person from his/her big eyes according to the Chinese anthroposcopy literature. Specifically, to well estimate these fine-grained facial attributes, we propose a novel deep convolutional network in which a facial region pooling layer (FRP layer) is embedded, called FRP-net. The FRP layer uses the searched facial region windows (locates these facial attributes) instead of the commonly-used sliding windows. The experiments on facial attribute estimation demonstrate the potential of the automatic face reader framework, and qualitative and quantitative evaluations from the attractive and smart perspectives of face reading validate the excellence of the presented face reader framework.

Xiangbo Shu, Liyan Zhang, Jinhui Tang, Guo-Sen Xie, Shuicheng Yan
Posed and Spontaneous Expression Recognition Through Restricted Boltzmann Machine

This paper presents a new method to recognize posed and spontaneous expression through modeling their global spatial patterns in Restricted Boltzmann Machine (RBM). First, the displacements of facial feature points between apex and onset facial images are extracted as features, which capture spatial variations of facial points. Second, the point displacement related facial events are extracted from its displacements. Third, two RBM models are trained to capture spatial patterns embedded in posed and spontaneous expressions respectively. The recognition results on both USTC-NVIE and SPOS databases demonstrate the effectiveness of the proposed RBM approach in modeling complex spatial patterns embodied in posed and spontaneous expressions, and good performance on posed and spontaneous expression distinction.

Chongliang Wu, Shangfei Wang
DFRS: A Large-Scale Distributed Fingerprint Recognition System Based on Redis

As the fast growth of users, matching a given fingerprint with the ones in a massive database precisely and efficiently becomes more and more difficult. To fight against this challenging issue in “big data” era, we have designed in this paper a novel large-scale distributed Redis-based fingerprint recognition system called DFRS that introduces an innovative framework for fingerprint processing while incorporating many key technologies for data compression and computing acceleration. By using Base64 compressive encoding method together with key-value pair storage structure, the space reduction can be achieved up to 40 % in our experiments – which is particularly important as Redis is an in memory read-write NoSQL data storage system. To compensate the cost introduced by compressive encoding, the parallel decoding is adopted with the help of OpenMP, saving the time by above one third. Furthermore, the granularity-based division (RM$$+$$AM architecture) and the Quick-Return strategy bring significant improvement in matching time, making the whole system – DFRS feasible and efficient in large scale for massive data volume.

Bing Li, Zhen Huang, Jinbang Chen, Yifan Yuan, Yuxing Peng
Logo Recognition via Improved Topological Constraint

Real-world logo recognition is challenging mainly due to various viewpoints and different lighting conditions. Currently, the most popular approaches are usually based on bag-of-words model due to their good performance. However, their shortcomings lie in two main aspects: (1) wrong recognition results caused by mismatching of keypoints. (2) high computational complexity and extra noise caused by a large number of keypoints which are irrelevant to the target logo. To address these two problems, we propose a new approach which combines feature selection and topological constraint for logo recognition. Firstly, feature selection is applied to filter out most of the irrelevant keypoints. Secondly, an improved topological constraint, which considers the relative position between a keypoint and its neighboring points, is proposed to reduce the number of mismatched keypoints. It is proven in this paper that the proposed constraint can remove the keypoints which are not on the same planar surface with the others from the k nearest neighbors of a keypoint. This property is very important to logo recognition because logos are planar objects in real world. The proposed approach is evaluated on two challenging logo recognition benchmarks, FlickrLogos-32 and FlickrLogos-27, and the experimental results show its effectiveness compared to other popular methods.

Panpan Tang, Yuxin Peng
Compound Figure Separation Combining Edge and Band Separator Detection

We propose an image processing algorithm to automatically separate compound figures appearing in scientific articles. We classify compound images into two classes and apply different algorithms for detecting vertical and horizontal separators to each class: the edge-based algorithm aims at detecting visible edges between subfigures, whereas the band-based algorithm tries to detect whitespace separating subfigures (separator bands). The proposed algorithm has been evaluated on two datasets for compound figure separation (CFS) in the biomedical domain and compares well to semi-automatic or more comprehensive state-of-the-art approaches. Additional experiments investigate CFS effectiveness and classification accuracy of various classifier implementations.

Mario Taschwer, Oge Marques
Camera Network Based Person Re-identification by Leveraging Spatial-Temporal Constraint and Multiple Cameras Relations

With the rapid development of multimedia technology and vast demand on video investigation, long-term cross-camera object tracking is increasingly important in the practical surveillance scene. Because the conventional Paired Cameras based Person Re-identification (PCPR) cannot fully satisfy the above requirement, a new framework named Camera Network based Person Re-identification (CNPR) is introduced. Two phenomena have been investigated and explored in this paper. First, the same person cannot simultaneously appear in two non-overlapping cameras. Second, the closer two cameras, the more relevant they are, in the sense that persons can transit between them with a high probability. Based on these two phenomena, a probabilistic method is proposed with reference to both visual difference and spatial-temporal constraint, to address the novel CNPR problem. (i) Spatial-temporal constraint is utilized as a filter to narrow the search space for the specific query object, and then the Weibull Distribution is exploited to formulate the spatial-temporal probability indicating the possibility of pedestrians walking to a certain camera at a certain time. (ii) Spatial-temporal probability and visual feature probability are collaborated to generate the ranking list. (iii) The multiple camera relations related to the transitions are explored to further optimize the obtained ranking list. Quantitative experiments conducted on TMin and CamNeT datasets have shown that the proposed method achieves a better performance to the novel CNPR problem.

Wenxin Huang, Ruimin Hu, Chao Liang, Yi Yu, Zheng Wang, Xian Zhong, Chunjie Zhang
Global Contrast Based Salient Region Boundary Sampling for Action Recognition

Although the excellent representation ability of improved Dense Trajectory (iDT) based features for action video had been proved on several action datasets, the performance of action recognition still suffers from large camera motion of videos. In this paper, we improve the iDT method by advancing a novel salient region boundary based dense sampling strategy, which reduces the number of trajectories while preserves the discriminative power. We first implement the iDT sampling based on motion boundary image, then introduce a global contrast based salient object segmentation method in interest points sampling step of action recognition. To overcome the flaws of global color contrast-based salient region sampling, we apply morphological gradient to generate a more robust mask for sampling dense points, as motion boundaries are much clearer. To evaluate the proposed method, we conduct extensive experiments on two benchmarks including HMDB51 and UCF50. The results show that our sampling strategy can improve the performance of action recognition with minor computational cost of mask production. In particular, on the HMDB51 dataset, the improvement over the original iDT result is 3 %. Meanwhile, any other dense features of action recognition can achieve more competitive performance by utilizing our sampling strategy and Fisher vector encoding method simply.

Zengmin Xu, Ruimin Hu, Jun Chen, Huafeng Chen, Hongyang Li
Elastic Edge Boxes for Object Proposal on RGB-D Images

Object proposal is utilized as a fundamental preprocessing of various multimedia applications by detecting the candidate regions of objects in images. In this paper, we propose a novel object proposal method, named elastic edge boxes, integrating window scoring and grouping strategies and utilizing both color and depth cues in RGB-D images. We first efficiently generate the initial bounding boxes by edge boxes, and then adjust them by grouping the super-pixels within elastic range. In bounding boxes adjustment, the effectiveness of depth cue is explored as well as color cue to handle complex scenes and provide accurate box boundaries. To validate the performance, we construct a new RGB-D image dataset for object proposal with the largest size and balanced object number distribution. The experimental results show that our method can effectively and efficiently generate the bounding boxes with accurate locations and it outperforms the state-of-the-art methods considering both accuracy and efficiency.

Jing Liu, Tongwei Ren, Jia Bei
Pairing Contour Fragments for Object Recognition

Contour fragments are adept to interpret the characteristics of object boundaries, but difficult to encode the information of object interior region. In this paper, inspired by the Gestalt principles that people can perceive object interior region by grouping similar and proximate fragments, we propose to pair contour fragments to encode more information of object interior region. To this end, we propose a pairing algorithm to generate Contour Fragment Pairs (CFPs). According to the proposed algorithm, the fragments of a valid CFP are required to be: co-occurrent over the training images, similar in shape, and proximate with each other. With a valid CFP, we can represent object shape using its fragments and object interior region using the region between its fragments. Finally, we design a boosting algorithm to select and assemble many CFPs into a classifier. The proposed classifier is competent for localizing objects with bounding boxes, delineating boundary and segmenting foreground. Moreover, the method possesses another merit that it only requires annotated bounding boxes as training data. Experiments on the public datasets show that the proposed approach achieves very promising performance.

Wei Zheng, Qian Zhang, Zhixuan Li, Junjun Xiong
Instance Search with Weak Geometric Correlation Consistency

Finding object instances from large image collections is a challenging problem with many practical applications. Recent methods inspired by text retrieval achieved good results; however a re-ranking stage based on spatial verification is still required to boost performance. To improve the effectiveness of such instance retrieval systems while avoiding the computational complexity of a re-ranking stage, we explored the geometric correlations among local features and incorporate these correlations with each individual match to form a transformation consistency in rotation and scale space. This weak geometric correlation consistency can be used to effectively eliminate inconsistent feature matches and can be applied to all candidate images at a low computational cost. Experimental results on three standard evaluation benchmarks show that the proposed approach results in a substantial performance improvement compared with recent proposed methods.

Zhenxing Zhang, Rami Albatal, Cathal Gurrin, Alan F. Smeaton
Videopedia: Lecture Video Recommendation for Educational Blogs Using Topic Modeling

Two main sources of educational material for online learning are e-learning blogs like Wikipedia, Edublogs, etc., and online videos hosted on various sites like YouTube, Videolectures.net, etc. Students would benefit if both the text and videos are presented to them in an integrated platform. As the two types of systems are separately designed, the major challenge in leveraging both sources is how to obtain video materials, which are relevant to an e-learning blog. We aim to build a system that seamlessly integrates both the text-based blogs and online videos and recommends relevant videos for explaining the concepts given in a blog. Our algorithm uses content extracted from video transcripts generated by closed captions. We use topic modeling to map videos and blogs in the common semantic space of topics. After matching videos and blogs in the space of topics, videos with high similarity values are recommended for the blogs. The initial results are plausible and confirm the effectiveness of the proposed scheme.

Subhasree Basu, Yi Yu, Vivek K. Singh, Roger Zimmermann
Towards Training-Free Refinement for Semantic Indexing of Visual Media

Indexing of visual media based on content analysis has now moved beyond using individual concept detectors and there is now a focus on combining concepts or post-processing the outputs of individual concept detection. Due to the limitations and availability of training corpora which are usually sparsely and imprecisely labeled, training-based refinement methods for semantic indexing of visual media suffer in correctly capturing relationships between concepts, including co-occurrence and ontological relationships. In contrast to training-dependent methods which dominate this field, this paper presents a training-free refinement (TFR) algorithm for enhancing semantic indexing of visual media based purely on concept detection results, making the refinement of initial concept detections based on semantic enhancement, practical and flexible. This is achieved using global and temporal neighbourhood information inferred from the original concept detections in terms of weighted non-negative matrix factorization and neighbourhood-based graph propagation, respectively. Any available ontological concept relationships can also be integrated into this model as an additional source of external a priori knowledge. Experiments on two datasets demonstrate the efficacy of the proposed TFR solution.

Peng Wang, Lifeng Sun, Shiqang Yang, Alan F. Smeaton
Deep Learning Generic Features for Cross-Media Retrieval

Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, how to effectively uncover the correlations between multimodal data has been a barrier to successful retrieval of cross-media data. The traditional approaches learn the connection between multiple modalities by direct utilization of hand-crafted low-level heterogeneous features and the learned correlation are merely constructed in terms of high-level feature representation. To well exploit the intrinsic structures of multimodal data, it is essential to build up an interpretable correlation between multimodal data. In this paper, we propose a deep model to learn the high-level feature representation shared by multiple modalities for cross-media retrieval. We learn the discriminative high-level feature representation in a data-driven manner before faithfully encoding the multimodal correlations. We use the large-scale multimodal data crawled from Internet to train our deep model and evaluate its effectiveness on cross-media retrieval based on NUS-WIDE dataset. The experimental results show that the proposed model outperforms other state-of-the-arts approaches.

Xindi Shang, Hanwang Zhang, Tat-Seng Chua
Cross-Media Retrieval via Semantic Entity Projection

Cross-media retrieval is becoming increasingly important nowadays. To address this challenging problem, most existing approaches project heterogeneous features into a unified feature space to facilitate their similarity computation. However, this unified feature space usually has no explicit semantic meanings, which might ignore the hints contained in the original media content, and thus is not able to fully measure the similarities among different media types. By considering the above issues, we propose a new approach to cross-media retrieval via semantic entity projection (SEP) in this paper. Our approach consists of three main steps. Firstly, an entity level with fine-grained semantics between low-level features and high-level concepts are constructed, so as to help bridge the semantic gap to a certain extent. Then, an entity projection is learned by minimizing both cross-media correlation error and single-media reconstruction error from low-level features to the entity level, with which a unified feature space with explicit semantic meanings can be obtained from low-level features. Finally, the semantic abstraction of high-level concepts is generated by using logistic regression to conduct cross-media retrieval. Experimental results on the Wikipedia dataset show the effectiveness of the proposed approach.

Lei Huang, Yuxin Peng
Visual Re-ranking Through Greedy Selection and Rank Fusion

Image search re-ranking has proven its effectiveness in the text-based image search system. However, traditional re-ranking algorithm heavily relies on the relevance of the top-ranked images. Due to the huge semantic gap between query and the image, the text-based retrieval result is unsatisfactory. Besides, single re-ranking model has large variance and is easy to over-fit. Instead, multiple re-ranking models can better balance the biased and the variance. In this paper, we first conduct label de-noising to filter false-positive images. Then a simple greedy graph-based re-ranking algorithm is proposed to derive the resulting list. Afterwards, different images are chosen as the seed images to perform re-ranking multiple times. Using the rank fusion, the results from different graphs are combined to form a better result. Extensive experiments are conducted on the INRIA web353 dataset and demonstrate that our method achieves significant improvement over state-of-the-art methods.

Bin Lin, Ai Wei, Xinmei Tian
No-reference Image Quality Assessment Based on Structural and Luminance Information

Research on no-reference image quality assessment (IQA) aims to develop a computational model simulating the human perception of image quality accurately and automatically without any prior information about the reference clean image signals. In this paper, we introduce a novel no-reference IQA metric, based on the analysis of structural degradation and luminance changes. Since the human visual system (HVS) is highly sensitive to structural distortion, we encode the image structural information as the local binary pattern (LBP) distribution. Besides, image quality is also affected by luminance changes, which cannot be captured properly by LBP threshold mechanism. Hence, the distribution of normalized luminance magnitudes is also included in the proposed IQA metric. Extensive experiments conducted on two large public image databases have demonstrated the effectiveness and robustness of the proposed metric in comparison with the relevant state-of-the-art metrics.

Qiaohong Li, Weisi Lin, Jingtao Xu, Yuming Fang, Daniel Thalmann
Learning Multiple Views with Orthogonal Denoising Autoencoders

Multi-view learning techniques are necessary when data is described by multiple distinct feature sets because single-view learning algorithms tend to overfit on these high-dimensional data. Prior successful approaches followed either consensus or complementary principles. Recent work has focused on learning both the shared and private latent spaces of views in order to take advantage of both principles. However, these methods can not ensure that the latent spaces are strictly independent through encouraging the orthogonality in their objective functions. Also little work has explored representation learning techniques for multi-view learning. In this paper, we use the denoising autoencoder to learn shared and private latent spaces, with orthogonal constraints — disconnecting every private latent space from the remaining views. Instead of computationally expensive optimization, we adapt the backpropagation algorithm to train our model.

TengQi Ye, Tianchun Wang, Kevin McGuinness, Yu Guo, Cathal Gurrin
Fast Nearest Neighbor Search in the Hamming Space

Recent years have witnessed growing interests in computing compact binary codes and binary visual descriptors to alleviate the heavy computational costs in large-scale visual research. However, it is still computationally expensive to linearly scan the large-scale databases for nearest neighbor (NN) search. In [15], a new approximate NN search algorithm is presented. With the concept of bridge vectors which correspond to the cluster centers in Product Quantization [10] and the augmented neighborhood graph, it is possible to adopt an extract-on-demand strategy on the online querying stage to search with priority. This paper generalizes the algorithm to the Hamming space with an alternative version of k-means clustering. Despite the simplicity, our approach achieves competitive performance compared to the state-of-the-art methods, i.e., MIH and FLANN, in the aspects of search precision, accessed data volume and average querying time.

Zhansheng Jiang, Lingxi Xie, Xiaotie Deng, Weiwei Xu, Jingdong Wang
SOMH: A Self-Organizing Map Based Topology Preserving Hashing Method

Hashing based approximate nearest neighbor search techniques have attracted considerable attention in media search community. An essential problem of hashing is to keep the neighborhood relationship while doing hashing map. In this paper, we propose a self-organizing map based hashing method–SOMH, which cannot only keep similarity relationship, but also preserve topology of data. Specifically, in SOMH, self-organizing map is introduced to map data points into hamming space. In this framework, in order to make it work well on short and long binary codes, we propose a relaxed version of SOMH and a product space SOMH, respectively. For the optimization problem of relaxed SOMH, we also present an iterative solution. To test the performance of SOMH, we conduct experiments on two benchmark datasets–SIFT1M and GIST1M. Experimental results show that SOMH can outperform or is comparable to several state-of-the-arts.

Xiao-Long Liang, Xin-Shun Xu, Lizhen Cui, Shanqing Guo, Xiao-Lin Wang
Describing Images with Ontology-Aware Dictionary Learning

In this paper, we focus on the generation of contextual descriptions for images by learning an ontology-aware dictionary. Ontology deals with questions concerning what entities exist and how such entities can be related with a hierarchy. Thus, if we incorporate the semantic hierarchies of visual concepts into a learned visual dictionary, which consists of visual atoms, we can generate contextual descriptions of testing images through the reconstruction. This paper proposes to learn the ontology-aware dictionary by integrating hierarchical dictionary learning and multi-task regression into a joint framework. By utilizing a hierarchical regularization term defined on the multiple semantic categories, the hierarchical structures are introduced into the multi-task regression. The joint optimization of the sparse coding and multi-task regression makes the semantic hierarchies embedded into the learned dictionary. Experiments on two benchmark datasets show the better performance of the proposed algorithm. Examples of the ontology-aware dictionary and generated image descriptions successfully demonstrate the effectiveness of the proposed framework.

Chengyue Zhang, Yahong Han
Quality Analysis on Mobile Devices for Real-Time Feedback

Media capture of live events such as concerts can be improved by including user generated content, adding more perspectives and possibly covering scenes outside the scope of professional coverage. In this paper we propose methods for visual quality analysis on mobile devices, in order to provide direct feedback to the contributing user about the quality of the captured content. Thus, wasting bandwidth and battery for uploading/streaming low-quality content can be avoided. We focus on real-time quality analysis that complements information that can be obtained from other sensors (e.g., stability). The proposed methods include real-time capable algorithms for sharpness, noise and over-/ underexposure which are integrated in a capture app for Android. Objective evaluation results show that our algorithms are competitive to state-of-the art quality algorithms while enabling real-time quality feedback on mobile devices.

Stefanie Wechtitsch, Hannes Fassold, Marcus Thaler, Krzysztof Kozłowski, Werner Bailer
Interactive Search in Video: Navigation With Flick Gestures vs. Seeker-Bars

On touch-based devices such as smartphones and tablets users are accustomed to browse through lists and collections by using flick gestures. For video navigation, however, mobile touch devices still use the seeker-bar interaction concept. In this paper, we evaluate the performance of a flick gesture-based video player in direct comparison to a default video player with seeker-bar navigation for the purpose of interactive search in video. We have developed a special video player on a tablet device and performed a user study with 16 users with two different types of interactive search: target/known-item search and scene counting. Our results show that the flick-based video player is less performant than the default video player in terms of search time, but more efficient in finding target scenes and the preferred interface by the vast majority of tested users.

Klaus Schoeffmann, Marco A. Hudelist, Bonifaz Kaufmann, Kevin Chromik
Second-Layer Navigation in Mobile Hypervideo for Medical Training

Hypervideos yield to different challenges in the area of navigation due to their underlying graph structure. Especially when used on tablets or by older people, a lack of clarity may lead to confusion and rejection of this type of medium. To avoid confusion, the hypervideo can be extended with a well known table of contents, which needs to be created separately by the authors due to an underlying graph structure. In this work, we present an extended presentation of a table of contents for hypervideos on mobile devices. The design was tested in a real world medical training scenario with the target group of people older than 45 which is the main target group of these applications. This user group is a particular challenge since they sometimes have limited experience in the use of mobile devices and physical deficiencies with growing age. Our user interface was designed in three steps. The findings of an expert group and a survey were used to create two different prototypical versions of the display, which were then tested against each other in a user test. This test revealed that a divided view is desired. The table of contents in an easy-to-touch version should be on the left side and previews of scenes should be on the right side of the view. These findings were implemented in the existing SIVA HTML5 open source player (https://code.google.com/p/siva-producer/ (accessed February 06, 2015)) and tested with a second group of users. This test only lead to minor changes in the GUI.

Britta Meixner, Matthias Gold

Poster Papers

Frontmatter
Reverse Testing Image Set Model Based Multi-view Human Action Recognition

Recognizing human activities from videos becomes a hot research topic in computer vision, but many studies show that action recognition based on single view cannot obtain satisfying performance, thus, many researchers put their attentions on multi-view action recognition, but how to mine the relationships among different views still is a challenge problem. Since video face recognition algorithm based on image set has proved that image set algorithm can effectively mine the complementary properties of different views image, and achieves satisfying performance. Thus, Inspired by these, image set is utilized to mine the relationships among multi-view action recognition. However, the studies show that the sample number of gallery and query set in video face recognition based on image set will affect the algorithm performance, and several ten to several hundred samples is supplied, but, in multi-view action recognition, we only have 3–5 views (samples) in each query set, which will limit the effect of image set.In order to solve the issues, reverse testing image set model (called RTISM) based multi-view human action recognition is proposed. We firstly extract dense trajectory feature for each camera, and then construct the shared codebook by k-means for all cameras, after that, Bag-of-Word (BoW) weight scheme is employed to code these features for each camera; Secondly, for each query set, we will compute the compound distance with each image subset in gallery set, after that, the scheme of the nearest image subset (called RTIS) is chosen to add into the query set; Finally, RTISM is optimized where the query set and RTIS are whole reconstructed by the gallery set, thus, the relationship of different actions among gallery set and the complementary property of different samples among query set are meanwhile excavated. Large scale experimental results on two public multi-view action3D datasets - Northwestern UCLA and CVS-MV-RGBD-Single, show that the reconstruction of query set over gallery set is very effectively, and RTIS added into query set is very helpful for classification, what is more, the performance of RTISM is comparable to the state-of-the-art methods.

Z. Gao, Y. Zhang, H. Zhang, G. P. Xu, Y. B. Xue
Face Image Super-Resolution Through Improved Neighbor Embedding

In the process of investigating a case, face image is the most interesting clue. However, due to the limitations of the imaging conditions and the low-cost camera, the captured face images are often Low-Resolution (LR), which cannot be used for criminal investigation. Face image super-resolution is the technology of inducing a High-Resolution (HR) face image from the observed LR face image. It has been a topic of wide concern recently. In this paper, we propose a novel face image super-resolution method based on Tikhonov Regularized Neighbor Representation, which is called TRNR for short. It can overcome the technological bottlenecks (e.g., instable solution) of the patch representation problem in traditional neighbor embedding based image super-resolution method. Specially, we introduce the Tikhonov regularization term to regularize the representation of the observation LR patch, which can give rise to a unique and stable solution for the least squares problem and produce detailed and discriminant HR faces. Extensive experiments on face image super-resolution are carried out to validate the generality, effectiveness, and robustness of the proposed algorithm. Experimental results on the public FEI face database show that the proposed method plays a better subjective and objective performance, which can recover more fine structures and details from an input low-resolution image, when compared to previously reported methods.

Kebin Huang, Ruimin Hu, Junjun Jiang, Zhen Han, Feng Wang
Adaptive Multichannel Reduction Using Convex Polyhedral Loudspeaker Array

Multichannel audio systems always need large numbers of loudspeakers and special placement, or sometimes need many subject evaluations to find an optimal arrangement. The application of multichannel systems is difficult in a typical home environment, where few number of loudspeakers, few subject evaluations and the placement flexibility are highly desirable.In this paper, a design of convex polyhedral loudspeaker arrays surround the target region is proposed for reducing multichannel system adaptively, and then derive an error metric to narrow down the subjective evaluation list. In such a way, a well-performed loudspeaker arrangement can be chosen by few subjective evaluations and placed in a restricted environment. The reproduction accuracy of the method is verified through numerical simulations, and the subjective evaluations indicate the effectiveness of the method.

Lingkun Zhang, Ruimin Hu, Dengshi Li, Xiaochen Wang, Weiping Tu
Dominant Set Based Data Clustering and Image Segmentation

Clustering is an important approach in image segmentation. While various clustering algorithms have been proposed, the majority of them require one or more parameters as input, making them a little inflexible in practical applications. In order to solve the parameter dependent problem, in this paper we present a parameter-free clustering algorithm based on the dominant sets. We firstly study the influence of regularization parameters on the dominant sets clustering results. As a result, we select an appropriate regularization parameter to generate over-segmentation in clustering results. In the next step we merge clusters based on the relationship between intra-cluster and inter-cluster similarities. While being simple, our algorithm is shown to improve the clustering quality significantly in comparison with the dominant sets algorithm in data clustering and image segmentation experiments. It also performs comparably to or better than some other clustering algorithms with manually selected parameters input.

Jian Hou, Chunshi Sha, Hongxia Cui, Lei Chi
An R-CNN Based Method to Localize Speech Balloons in Comics

Comic books enjoy great popularity around the world. More and more people choose to read comic books on digital devices, especially on mobile ones. However, the screen size of most mobile devices is not big enough to display an entire comic page directly. As a consequence, without any reflow or adaption to the original books, users often find that the texts on comic pages are hard to recognize when reading comics on mobile devices. Given the positions of speech balloons, it becomes quite easy to do further processing on texts to make them easier to read on mobile devices. Because the texts on a comic page often come along with surrounding speech balloons. Therefore, it is important to devise an effective method to localize speech balloons in comics. However, only a few studies have been done in this direction. In this paper, we propose a Regions with Convolutional Neural Network (R-CNN) based method to localize speech balloons in comics. Experimental results have demonstrated that the proposed method can localize the speech balloons in comics effectively and accurately.

Yongtao Wang, Xicheng Liu, Zhi Tang
Facial Age Estimation with Images in the Wild

In this paper, we investigate facial age estimation with images in the wild. We aim to utilize images from the Internet to alleviate the problem of imbalance in age distribution. First, we crawl 14,283 images with their context from Wikipedia and infer age labels from the context for each image. After face detection, facial landmark detection and alignment, we build a set of images for facial age estimation, containing 9,456 faces with significant variations. Then, we exploit cost-sensitive learning algorithms including biased penalties SVM and Random forests for age estimation, using images in the wild as the training set. We propose to use the Gaussian function to determine varied misclassification costs. Conducted on two public aging datasets, the within-database experiments illustrate the performance improvement with the introduction of images in the wild. Furthermore, our cross-database experiments validate the generalization capability of proposed cost-sensitive age estimator.

Ming Zou, Jianwei Niu, Jinpeng Chen, Yu Liu, Xiaoke Zhao
Fast Visual Vocabulary Construction for Image Retrieval Using Skewed-Split k-d Trees

Most of the image retrieval approaches nowadays are based on the Bag-of-Words (BoW) model, which allows for representing an image efficiently and quickly. The efficiency of the BoW model is related to the efficiency of the visual vocabulary. In general, visual vocabularies are created by clustering all available visual features, formulating specific patterns. Clustering techniques are k-means oriented and they are replaced by approximate k-means methods for very large datasets. In this work, we propose a faster construction of visual vocabularies compared to the existing method in the case of SIFT descriptors, based on our observation that the values of the 128-dimensional SIFT descriptors follow the exponential distribution. The application of our method to image retrieval in specific image datasets showed that the mean Average Precision is not reduced by our approximation, despite that the visual vocabulary has been constructed significantly faster compared to the state of the art methods.

Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis Kompatsiaris
OGB: A Distinctive and Efficient Feature for Mobile Augmented Reality

The distinctiveness and efficiency of a feature descriptor used for object recognition and tracking are fundamental to the user experience of a mobile augmented reality (MAR) system. However, existing descriptors are either too compute-expensive to achieve real-time performance on a mobile device, or not sufficiently distinctive to identify correct matches from a large database. As a result, current MAR systems are still limited in both functionalities and capabilities, which greatly restrict their deployment in practice. In this paper, we propose a highly distinctive and efficient binary descriptor, called Oriented Gradients Binary (OGB). OGB captures the major edge/gradient structure that is an important characteristic of local shapes and appearance. Specifically, OGB computes the distribution of major edge/gradient directions within an image patch. To achieve high efficiency, aggressive down-sampling is applied to the patch to significantly reduce the computational complexity, while maintaining major edge/gradient directions within the patch. Comparing to the state-of-the-art binary descriptors including ORB, BRISK and FREAK, which are primarily designed for speed, OGB has similar construction efficiency, while achieves a superior performance for both object recognition and tracking tasks running on a mobile handheld device.

Xin Yang, Xinggang Wang, Kwang-Ting (Tim) Cheng
Learning Relative Aesthetic Quality with a Pairwise Approach

Image aesthetic quality assessment is very useful in many multimedia applications. However, most existing researchers restrict quality assessment to a binary classification problem, which is to classify the aesthetic quality of images into “high” or “low” category. The strategy they applied is to learn the mapping from the aesthetic features to the absolute binary labels of images. The binary label description is restrictive and fails to capture the general relative relationship between images. We propose a pairwise-based ranking framework that takes image pairs as input to address this challenge. The main idea is to generate and select image pairs to utilize the relative ordering information between images rather than the absolute binary label information. We test our approach on two large scale and public datasets. The experimental results show our clear advantages over traditional binary classification-based approach.

Hao Lv, Xinmei Tian
Robust Crowd Segmentation and Counting in Indoor Scenes

This paper proposes a fast counting approach to estimate the number of people in indoor scenes. Firstly, a pre-processing step is used. In order to obtain a robust gray image in complex light conditions this step includes color correlation, image smoothing and contrast stretch. Secondly, we extract foreground region by background edge modeling and contour filling. Finally, after a foreground normalization based on camera calibration, we obtain the counting results with template matching. Experimental results show that compared with the Bayesian counting approach [2], our approach is robust to illumination variation and achieves a real-time counting result in indoor scenes.

Ren Yang, Huazhong Xu, Jinqiao Wang
Robust Sketch-Based Image Retrieval by Saliency Detection

Sketch-based image retrieval (SBIR) has been extensively studied for decades because sketch is one of the most intuitive ways to describe ideas. However, the large expressional gap between hand-drawn sketches and natural images with small-scale complex structures is the fundamental challenge for SBIR systems. We present a novel framework to efficiently retrieve images with a query sketch based on saliency detection. In order to extract primary contours of the scene and depress textures, a hierarchical saliency map is computed for each image. Object contours are extracted from the saliency map instead of the original natural image. Histograms of oriented gradients (HOG) are extracted at multiple scales on a dense gradient field. Using a bag-of-visual-words representation and an inverted index structure, our system efficiently retrieves images by sketches. The experimental results conducted on a dataset of 15 k photographs demonstrate that our method performs well for a wide range of natural scenes.

Xiao Zhang, Xuejin Chen
Image Classification Using Spatial Difference Descriptor Under Spatial Pyramid Matching Framework

Spatial pyramid matching (SPM) model is an extension of the bag-of-visual words (BoW) model for local feature encoding. It firstly partitions the image into increasingly fine sub-regions, and then concatenates the histograms within each sub-region. However, the SPM model does not consider the spatial information differences between sub-regions explicitly. To make use of this information, we exploit a novel descriptor called spatial difference. In the process of promoting the performance of image classification, this descriptor is mainly used to concatenate the histograms of bag-of-visual words model under spatial pyramid matching framework. Finally, we conduct image classification experiments on several public datasets to demonstrate the effectiveness of the proposed scheme.

Yuhui Li, Jiucheng Xu, Yifan Zhang, Chunjie Zhang, Hongsheng Yin, Hanqing Lu
Exploring Relationship Between Face and Trustworthy Impression Using Mid-level Facial Features

When people look at a face, they always build an affective subconscious impression of the person which is very useful information in social contact. Exploring relationship between facial appearance in portraits and personality impression is an interesting and challenging issue in multimedia area. In this paper, a novel method which can build relationship between facial appearance and personality impression is proposed. Low-level visual features are extracted on the defined face regions designed from psychology at first. Then, to alleviate the semantic gap between the low-level features and high-level affective features, mid-level feature set are built through clustering method. Finally, classification model is trained using our dataset. Comprehensive experiments demonstrate the effectiveness of our method by improving 26.24 % in F1-measure and 54.28 % in recall under similar precision comparing to state-of-the-art works. Evaluation of different mid-level feature combinations further illustrates the promising of the proposed method.

Yan Yan, Jie Nie, Lei Huang, Zhen Li, Qinglei Cao, Zhiqiang Wei
Edit-Based Font Search

This paper presents an interactive font search method for users who have no significant knowledge of fonts. The proposed method suggests some font candidates by deforming a displayed font image into a font resembling a user’s desired font. Generally, font category names, font tags regarding human impressions such as “Cute”, and a font list are used for searching a font. However, there are some issues: knowledge of font category names is required, each user gets a different impression from a font, and the font search from a font list becomes tedious as the number of font candidates increases. We expect that the proposed method can solve these problems, because it allows users to search a font easily.

Ken Ishibashi, Kazunori Miyata
Private Video Foreground Extraction Through Chaotic Mapping Based Encryption in the Cloud

Recently, storage and processing large-scale visual media data are being outsourced to Cloud Data Centres (CDCs). However, the CDCs are always third party entities. Thus the privacy of the users’ visual media data may be leaked to the public or unauthorized parties. In this paper we propose a method of privacy preserving foreground extraction of video surveillance through chaotic mapping based encryption in the cloud. The client captures surveillance videos, which are then encrypted by our proposed chaotic mapping based encryption method. The encrypted surveillance videos are transmitted to the cloud server, in which the foreground extraction algorithm is running on the encrypted videos. The results are transmitted back to the client, in which the extraction results are decrypted to get the extraction results in plain videos. The extraction correctness in the encryption videos is similar as that in the plain videos. The proposed method has several advantages: (1) The server only learns the obfuscated extraction results and can not recognize anything from the results. (2) Based on our encryption method, the original extraction method in the plain videos need not be changed. (3) The chaotic mapping ensure high level security and the ability to resistant several attacks.

Xin Jin, Kui Guo, Chenggen Song, Xiaodong Li, Geng Zhao, Jing Luo, Yuzhen Li, Yingya Chen, Yan Liu, Huaichao Wang
Evaluating Access Mechanisms for Multimodal Representations of Lifelogs

Lifelogging, the automatic and ambient capture of daily life activities into a digital archive called a lifelog, is an increasingly popular activity with a wide range of applications areas including medical (memory support), behavioural science (analysis of quality of life), work-related (auto-recording of tasks) and more. In this paper we focus on lifelogging where there is sometimes a need to re-find something from one’s past, recent or distant, from the lifelog. To be effective, a lifelog should be accessible across a variety of access devices. In the work reported here we create eight lifelogging interfaces and evaluate their effectiveness on three access devices; laptop, smartphone and e-book reader, for a searching task. Based on tests with 16 users, we identify which of the eight interfaces are most effective for each access device in a known-item search task through the lifelog, for both the lifelog owner, and for other searchers. Our results are important in suggesting ways in which personal lifelogs can be most effectively used and accessed.

Zhengwei Qiu, Cathal Gurrin, Alan F. Smeaton
Analysis and Comparison of Inter-Channel Level Difference and Interaural Level Difference

The directional perception of human ear for the sound at horizontal plane mainly depends on binaural cues, Interaural Level Difference (ILD), Interaural Time Difference (ITD) and Interaural Correlation (IC). And ILD plays a leading role for human to locate the position of sound with frequency above 1.5 KHz. In spatial audio applications, Inter-Channel Level Difference (ICLD) between loudspeaker signals are used to represent the location information of phantom sources generated by two loudspeakers. For headphone application, ILD and ICLD are approximate, so the perceptual characteristics of ILD can be used as a replacement for that of ICLD. But due to the attenuation influence of the transfer procedure from loudspeakers to humans ears, ICLD between loudspeakers signals are no longer the same with ILD between signals arrive at two ears. And these differences are always ignored in current spatial audio applications such as the perceptual coding of spatial parameters. So in this paper we focus on the analysis and comparison of ICLD and ILD from their formation and their values with different loudspeaker configurations. Experimental results showed that the difference of ILD and ICLD could be up to 55 dB, and the research of this paper may be an important part or reference for further research about spatial audio applications such as coding, reconstruction, etc.

Tingzhao Wu, Ruimin Hu, Li Gao, Xiaochen Wang, Shanfa Ke
Automatic Scribble Simulation for Interactive Image Segmentation Evaluation

To provide comprehensive evaluation of interactive image segmentation algorithms, we propose an automatic scribble simulation approach. We first analyze the variety of scribbles labelled by different users and its influence on segmentation result. Then, we describe the consistency and inconsistency of scribbles with normal distribution on superpixel level and superpixel group level, and analyze the effect of connection in scribble for interactive segmentation evaluation. Based on the above analysis, we simulate scribbles on foreground and background respectively by randomly selecting superpixel groups and superpixels with the previously determined coverage values. The experimental results show that the scribbles simulated by the proposed approach can obtain similar evaluation results to manually labelled scribbles and avoid serious deviation in precision and recall evaluation.

Bingjie Jiang, Tongwei Ren, Jia Bei
Multi-modal Image Re-ranking with Autoencoders and Click Semantics

Image re-ranking is effective in improving text-based image retrieval experience. However, to construct an efficacious algorithm to achieve such a target is limited by two important issues: one is that visual features extracted for image re-ranking from images are too superficial to represent the whole information contained within images; the other is that the corresponding text information often mismatches semantics of images. In this paper, we utilize autoencoders to extract deeper features of images and exploit click data to bridge the semantic gap between query words and image semantics. A graph-based algorithm(MIR-AC) is proposed to adaptively integrate features from autoencoders and click information by constructing two manifolds with updating weights. In particular, MIR-AC completes image re-ranking by conducting an iterative optimization process in which image ranking scores and weights of manifolds are updated alternatively. Experiments are conducted on a real world dataset and results demonstrate that MIR-AC outperforms given state-of-arts in image re-ranking.

Chaohui Tang, Qingxin Zhu, Chaoqun Hong, Jun Yu
Sketch-Based Image Retrieval with a Novel BoVW Representation

A novel Bag-of-Visual-Word (BoVW) based approach is developed in this paper to facilitate more effective Sketch-based Image Retrieval (SBIR). We focus on constructing the visual vocabulary based on the BoVW representation with both the spatial distribution and inter-relationship of descriptors. To optimize the sketch-image matching, the weighting quantization is created by integrating both the neighbor and spatial feature information to quantify features as visual words. We emphasize on an inverted indexing by converting an image to a trigram representation with visual words and their spatial information. Our experiments have obtained very positive results.

Cheng Jin, Chenjie Li, Zheming Wang, Yuejie Zhang, Tao Zhang
Symmetry-Aware Human Shape Correspondence Using Skeleton

In this paper, we propose a symmetry-aware human shape correspondence extraction method. We address the symmetric flip problem which exists in establishing correspondences for intrinsically symmetric models and improve the accuracy of the final corresponding pairs. To achieve this goal, we extended the state-of-the-art approach by using skeleton information to further remove symmetric flipped shape correspondences. Traditional approaches that only rely on surface geometry information can hardly discriminate surface points which are symmetric. With the appearance of inexpensive RGB-D camera, such as Kinect, skeleton information can be easily obtained along with mesh. Therefore, after the initial correspondences are achieved, we extend the candidate sets for each point on the template, followed by making use of skeleton to remove the symmetric flipped false candidates. In the remaining candidates, final correspondences are achieved by choosing those with minimum geodesic distortion from base vertex set, which is formed by sampling on the mesh. Experiments demonstrate that the proposed method can effectively remove all the symmetric flipped candidates. Moreover, the final correspondence pair is more accurate than those of the state of the arts.

Zongyi Xu, Qianni Zhang
XTemplate 4.0: Providing Adaptive Layouts and Nested Templates for Hypermedia Documents

A hypermedia composite template defines generic structures of nodes and links that can be reused in different hypermedia compositions. XTemplate is an XML-based language for the definition of hypermedia composite templates. XTemplate can currently be used to create templates for NCL documents, but other hosting languages can also be used.In current versions of hypermedia document template languages, including XTemplate, there is no facility for defining template layouts. This work extends XTemplate, incorporating the concept of adaptive layouts. Adaptive layouts enable the definition of generic presentation characteristics for multimedia documents that are instantiated at processing time and adapted to the number of media objects declared in a given document that uses a template.Another important facility that this work incorporates in XTemplate is hypermedia composite template nesting. Template nesting enables the inclusion of template components inside other hypermedia composite templates, thus making the use of multiple nested templates transparent to the document author that uses templates.

Glauco F. Amorim, Joel A. F. dos Santos, Débora C. Muchaluat-Saade
Level Ratio Based Inter and Intra Channel Prediction with Application to Stereo Audio Frame Loss Concealment

The problem of side signal frame loss concealment for Mid/Side (M/S) stereo audio is addressed in this paper. The proposed level ratio based inter and intra channel (LRIIC) prediction method is designed to overcome the main challenge due to the diversified stereophonic nature of audio signals. To identify the stereophonic nature, we employ the time varying level ratio between mid (also called monophonic) signal and side signal. In the first phase, one Winer filter is designed from the monophonic signal and side signal of previous frame. And the other one is designed from the side signal of previous two frames. The two filters allow reconstruction of the current loss frame with inter and intra channel prediction respectively. In the second phase, available current frame of monophonic signal is used as the input of inter-channel Winer filter. While long-term prediction filter is employed to find the periodic components of previous frames as the input of intra-channel Winer filter. Finally, a level ratio based linear combination of inter and intra channel prediction output is employed to get the current lost side signal. Objective and subjective evaluation results for the proposed LRIIC approach, in comparison with existing techniques, all incorporated within a 3GPP AMR-WB+ decoder, provide evidence for gains across a variety of stereophonic signals.

Yuhong Yang, Yanye Wang, Ruimin Hu, Hongjiang Yu, Li Gao, Song Wang
Depth Map Coding by Modeling the Locality and Local Correlation of View Synthesis Distortion in 3-D Video

We propose depth map coding method by modeling the locality and local correlation of view synthesis distortion (VSD) in three dimensional (3D) video. Taking into account local characteristics of both depth map quantization error and color video, we start by dividing depth map into two kinds of regions: error sensitive area (ESA), where depth map quality is sensitive to quantization and the synthesized view quality is sensitive to depth map distortion, and error resilient area (ERA) otherwise. The locality of VSD is established by modeling the relationship between synthesized view distortion and depth map distortion for different regions separately. Then the local correlation of VSD is exploited in terms of synthesized view distortion propagated from other non-corresponding regions of depth map due to correlation of regions during temporal/spatial prediction in depth map coding. The final VSD model is obtained by considering both the locality and local correlation and implemented in the rate-distortion (RD) optimized mode selection for depth map coding. Experimental results show that our solution can achieve 12.67 % and 11.22 % bit rate saving as well as 1.07 dB and 1.94 dB improvement of synthesized view quality on average at high and low bit rate, respectively, compared with H.264/AVC.

Qiong Xue, Xuguang Lan, Meng Yang
Discriminative Feature Learning with an Optimal Pattern Model for Image Classification

The co-occurrence features learned through pattern mining methods have more discriminative power to separate images from other categories than individual low-level features. However, the “pattern explosion” problem involved in mining process prevents its application in many visual tasks. In this paper, we propose a novel scheme to learn discriminative features based on a mined optimal pattern model. The proposed method deals with the “pattern explosion” problem from two aspects, (1) it uses selected weak semantic patches instead of grid patches to substantially reduce the database to mine; (2) the adopted optimal pattern model can produce compact and representative patterns which make the resulted image code more effective and discriminative for classification. In our work, we apply the minimal description length (MDL) to mine the optimal pattern model. We evaluate the proposed method on two publicly available datasets (15-Scenes and Oxford-Flowers17) and the experimental results demonstrate its effectiveness.

Lijuan Liu, Yu Bao, Haojie Li, Xin Fan, Zhongxuan Luo
Sign Language Recognition Based on Trajectory Modeling with HMMs

Sign language recognition targets on interpreting and understanding the sign language for convenience of communication between the deaf and the normal people, which has broad social impact. The problem is challenging due to the large variations for different signers and the subtle difference between sign words. In this paper, we propose a new method for isolated sign language recognition based on trajectory modeling with hidden Markov models (HMMs). In our approach, we first normalize and re-sample the raw trajectory data and partition the trajectory into multiple segments. To represent each trajectory segment, we proposed a new curve feature descriptor based on shape context. After that, hidden Markov model is used to model each isolated sign word for recognition. To evaluate the performance of our proposed algorithm, we have built a large isolated Chinese sign language vocabulary with Kinect 2.0. The dataset contains 100 unique isolated sign words, each of which is performed by 50 signers for 5 times. Experimental results demonstrate that the proposed method achieves a better performance compared with normal coordinate feature with HMM.

Junfu Pu, Wengang Zhou, Jihai Zhang, Houqiang Li
MusicMixer: Automatic DJ System Considering Beat and Latent Topic Similarity

This paper presents MusicMixer, an automatic DJ system that mixes songs in a seamless manner. MusicMixer mixes songs based on audio similarity calculated via beat analysis and latent topic analysis of the chromatic signal in the audio. The topic represents latent semantics about how chromatic sounds are generated. Given a list of songs, a DJ selects a song with beat and sounds similar to a specific point of the currently playing song to seamlessly transition between songs. By calculating the similarity of all existing pairs of songs, the proposed system can retrieve the best mixing point from innumerable possibilities. Although it is comparatively easy to calculate beat similarity from audio signals, it has been difficult to consider the semantics of songs as a human DJ considers. To consider such semantics, we propose a method to represent audio signals to construct topic models that acquire latent semantics of audio. The results of a subjective experiment demonstrate the effectiveness of the proposed latent semantic analysis method.

Tatsunori Hirai, Hironori Doi, Shigeo Morishima
Adaptive Synopsis of Non-Human Primates’ Surveillance Video Based on Behavior Classification

Non-human primates (NHPs) play a critical role in biomedical research. Automated monitoring and analysis of NHP’s behaviors through the surveillance video can greatly support the NHP-related studies. However, little research work has been undertaken yet. There are two challenges in analyzing the NHP’s surveillance video: the NHP’s behaviors are lack of regularity and intention, and serious occlusions are brought by the fences of the cages. In this paper, four typical NHPs’ behaviors are defined based on the requirement in pharmaceutical analysis. We design a novel feature set combining contextual attributes and local motion information to overcome the effects of occlusions. A hierarchical linear discriminant analysis (LDA) classifier is proposed to categorize the NHPs’ behaviors. Based on the behavior classification, an adaptive synopsis algorithm is further proposed to condense the NHPs’ surveillance video, which offers a mechanism to retrieve any NHP’s behavior information corresponding to specified events or time periods in the surveillance video. Experimental results show the effectiveness of the proposed method in categorizing and condensing NHPs’ surveillance video.

Dongqi Cai, Fei Su, Zhicheng Zhao
A Packet Scheduling Method for Multimedia QoS Provisioning

Size-based scheduling is an appealing solution to manage bottleneck links as the interactive (short) flows of users are offered almost constant service times, whatever the level of congestion of the link is. Size-based schedulers like LAS, Run2C or LARS offer different additional features, for instance the ability to protect low/medium rate long lasting multimedia transfers like VoIP, for the case of LARS. However, all these solutions have a significant memory footprint as they require to keep one state per flow, or alternatively to modify the TCP implementation of every end host. This constitutes a significant hindrance to the deployment of size-based scheduling in the wild. In this paper, we propose Early Flow Discard (EFD), a new size-based scheduler, which keeps the salient properties of state-of-the-art size based schedulers like LARS, with a bounded memory requirement. We demonstrate its efficiency by comparing it with several size-based and size-oblivious schedulers. We further demonstrate that size-based scheduling offers an interesting solution to the so-called bufferbloat phenomenon. This is achieved with a completely different approach than the one advocated in PIE or CoDel for instance, as EFD does not strive to keep the queue occupancy low but controls per flow response times, which increases with the flow size.

Jinbang Chen, Zhen Huang, Martin Heusse, Guillaume Urvoy-Keller
Robust Object Tracking Using Valid Fragments Selection

Local features are widely used in visual tracking to improve robustness in cases of partial occlusion, deformation and rotation. This paper proposes a local fragment-based object tracking algorithm. Unlike many existing fragment-based algorithms that allocate the weights to each fragment, this method firstly defines discrimination and uniqueness for local fragment, and builds an automatic pre-selection of useful fragments for tracking. Then, a Harris-SIFT filter is used to choose the current valid fragments, excluding occluded or highly deformed fragments. Based on those valid fragments, fragment-based color histogram provides a structured and effective description for the object. Finally, the object is tracked using a valid fragment template combining the displacement constraint and similarity of each valid fragment. The object template is updated by fusing feature similarity and valid fragments, which is scale-adaptive and robust to partial occlusion. The experimental results show that the proposed algorithm is accurate and robust in challenging scenarios.

Jin Zheng, Bo Li, Peng Tian, Gang Luo

Special Session Poster Papers

Frontmatter
Exploring Discriminative Views for 3D Object Retrieval

View-based 3D object retrieval techniques have become prevalent in various fields, and lots of ingenious studies have promoted the development of retrieval performance from different aspects. In this paper, we focus on the 2D projective views that represent the 3D objects and propose a boosting approach by evaluating the discriminative ability of each object’s views. Different from previous works on selecting representative views of query object, we investigate the discriminative information of each view in dataset. By employing the proposed reverse distance metric, we utilize the discriminative information for many to many view set matching. The proposed algorithm is then employed with various features to boost the multi-model graph learning method. We compare our approach with several state of the art methods on ETH-80 dataset and National Taiwan University 3D model dataset. The results demonstrate the effectiveness of our method and its excellent boosting performance.

Dong Wang, Bin Wang, Sicheng Zhao, Hongxun Yao, Hong Liu
What Catches Your Eyes as You Move Around? On the Discovery of Interesting Regions in the Street

Interesting regions are defined as parts of street view scenes that can attract people’s interests when they are moving on the road, and play an important role in various daily-life scenarios. In this paper, therefore, we propose a framework for locating interesting regions in the street and explore the potential use for advanced multimedia applications. Based on the psychological findings and cognitive processes, we proposed and quantified three properties for modeling interesting regions, including attractive, unique, and familiar. Also, a spatial-temporal fusion scheme is developed to combine the multiple properties for discovering the presence of interesting regions. We conduct a set of user studies to demonstrate the effectiveness of our approach. The results support that most users agreed with the interesting regions found by the proposed approach. Finally, a novel application based on interesting regions is also presented for offering an improved navigation experience to vehicle drivers.

Heng-Yu Chi, Wen-Huang Cheng, Chuang-Wen You, Ming-Syan Chen
Bag Detection and Retrieval in Street Shots

In recent years, e-commerce has become an important way people shop. Among this, clothes and bags are extraordinarily important for customers. However, traditional online shopping modes only allow users to search with key words. Sometimes users may find it very hard to precisely describe what they want in words. Moreover, even if a user gives a detailed description, it may not agree with the description provided by the seller. Therefore, search-by-image without the help of semantic descriptions becomes a research focus in computer vision and multi-media processing. In this paper, we address the problem of object detection and retrieval and focus particularly on bags in street shots. First, we locate the bag region in an image by Pairwise Context based Convolutional Neural Network (PC-CNN). After that, we learn high-level descriptions of bag images based on attributes and build a retrieval system allowing for image search. We test our approach on the publicly available Fashionista Benchmark (FB) and a Pedestrian with Bags dataset (PB) collected by ourselves to demonstrate the effectiveness of the proposed method.

Chong Cao, Yuning Du, Haizhou Ai
TV Commercial Detection Using Success Based Locally Weighted Kernel Combination

Classification problems using multiple kernel learning (MKL) algorithms achieve superior performance on account of using a weighted combination of base kernels on feature sub-sets. Each of the base kernels are characterized by the similarity measures defined over the feature sub-sets. Existing works in MKL have mostly used fixed weights which are shown to be related to the overall discriminative capability of corresponding base kernels. We argue that this class discrimination ability of a kernel is a local phenomenon and thus, advocate the necessity of using instance dependent functions for weighing the kernels. We propose a new framework for learning such weighing functions linked to ability of kernels to discriminate in the local regions of the feature space. During training, we first identify the regions of success in the feature sub-spaces, where the base kernels have high likelihood of success. These regions are identified by evaluating the performance of support vector machines (SVM) trained using corresponding (single) base kernels. The weighing functions are then estimated by using support vector regression (SVR). The target for SVRs is set to 1.0 for the successfully classified patterns and to 0.0, otherwise. The second contribution of this work is the construction and public domain release of a commercial detection dataset of 150 hours, acquired from 5 different TV news channels. Empirical results on 8 standard datasets and our own TV commercial detection dataset have shown the superiority of the proposed scheme of multiple kernel learning.

Raghvendra Kannao, Prithwijit Guha
Frame-Wise Continuity-Based Video Summarization and Stretching

This paper describes a method for freely changing the length of a video clip, leaving its content almost unchanged, by removing video frames considering both audio and video transitions. In a video clip that contains many video frames, there are less important frames that only extend the length of the clip. Taking the continuity of audio and video frames into account, the method enables changing the length of a video clip by removing or inserting frames that do not significantly affect the content. Our method can be used to change the length of a clip without changing the playback speed. Subjective experimental results demonstrate the effectiveness of our method in preserving the clip content.

Tatsunori Hirai, Shigeo Morishima
Respiration Motion State Estimation on 4D CT Rib Cage Images

Respiration motion state is an important indicator for disease diagnose in clinical practice. In this paper, we approach this problem with 4D CT rib cage images and target on identifying the end-inhalation and end-exhalation phrases. Observing that the motion of rib bones well reflect the respiration motion state, we transform this problem into a rib bone segmentation problem. Firstly, we propose a novel steerable filter enhanced level set method for rib bone segmentation. We formulate the level set segmentation problem as a variational optimization problem. To address the blurry edge issue, we enhance the image with the classic steerable filter. After that, by comparing the positions of rib bones in sequential frames, we present an criterion to determine the end-expiration and end-inspiration phrases. We validate our approach with real 4D CT rib cage images and demonstrate the effectiveness of our approach.

Chao Xie, Wengang Zhou, Weiping Ding, Houqiang Li, Weiping Li
Location-Aware Image Classification

Currently, the most popular image classification methods are based on global image representations. They face an obvious contradiction between the uncertainty of object position and the global image representation. In this paper, we propose a novel location-aware image classification framework to address this problem. In our framework, an image is classified based on local image representation, and the classifier is learned using an iterative multi-instance learning with a latent SVM, i.e., we infer object location using latent SVM to improve image classification. Our method is very efficient and outperforms the popular spatial pyramid matching (SPM) method and the Region Based Latent SVM (RBLSVM) method [1] on the challenging PASCAL VOC dataset.

Xinggang Wang, Xin Yang, Wenyu Liu, Chen Duan, Longin Jan Latecki
Enhancement for Dust-Sand Storm Images

A novel dust-sand storm image enhancement scheme is proposed. The input degraded color image is first convert into CIELAB color space. Then two chromatic components (a* and b*) are combined to perform color cast correction and saturation stretching. Meanwhile, the fast Local Laplacian Filtering is employed to lightness component (L*) to enhance details. Experimental results illustrate that enhanced images have natural colors, more clear details and better visual effect than original degraded images.

Jian Wang, Yanwei Pang, Yuqing He, Changshu Liu
Using Instagram Picture Features to Predict Users’ Personality

Instagram is a popular social networking application, which allows photo-sharing and applying different photo filters to adjust the appearance of a picture. By applying these filters, users are able to create a style that they want to express to their audience. In this study we tried to infer personality traits from the way users manipulate the appearance of their pictures by applying filters to them. To investigate this relationship, we studied the relationship between picture features and personality traits. To collect data, we conducted an online survey where we asked participants to fill in a personality questionnaire, and grant us access to their Instagram account through the Instagram API. Among 113 participants and 22,398 extracted Instagram pictures, we found distinct picture features (e.g., relevant to hue, brightness, saturation) that are related to personality traits. Our findings suggest a relationship between personality traits and these picture features. Based on our findings, we also show that personality traits can be accurately predicted. This allow for new ways to extract personality traits from social media trails, and new ways to facilitate personalized systems.

Bruce Ferwerda, Markus Schedl, Marko Tkalcic
Extracting Visual Knowledge from the Internet: Making Sense of Image Data

Recent successes in visual recognition can be primarily attributed to feature representation, learning algorithms, and the ever-increasing size of labeled training data. Extensive research has been devoted to the first two, but much less attention has been paid to the third. Due to the high cost of manual data labeling, the size of recent efforts such as ImageNet is still relatively small in respect to daily applications. In this work, we mainly focus on how to automatically generate identifying image data for a given visual concept on a vast scale.With the generated image data, we can train a robust recognition model for the given concept. We evaluate the proposed webly supervised approach on the benchmark Pascal VOC 2007 dataset and the results demonstrates the superiority of our method over many other state-of-the-art methods in image data collection.

Yazhou Yao, Jian Zhang, Xian-Sheng Hua, Fumin Shen, Zhenmin Tang
Ordering of Visual Descriptors in a Classifier Cascade Towards Improved Video Concept Detection

Concept detection for semantic annotation of video fragments (e.g. keyframes) is a popular and challenging problem. A variety of visual features is typically extracted and combined in order to learn the relation between feature-based keyframe representations and semantic concepts. In recent years the available pool of features has increased rapidly, and features based on deep convolutional neural networks in combination with other visual descriptors have significantly contributed to improved concept detection accuracy. This work proposes an algorithm that dynamically selects, orders and combines many base classifiers, trained independently with different feature-based keyframe representations, in a cascade architecture for video concept detection. The proposed cascade is more accurate and computationally more efficient, in terms of classifier evaluations, than state-of-the-art classifier combination approaches.

Foteini Markatopoulou, Vasileios Mezaris, Ioannis Patras
Spatial Constrained Fine-Grained Color Name for Person Re-identification

Person re-identification is a key technique to match different persons observed in non-overlapping camera views. It’s a challenging problem due to the huge intra-class variations caused by illumination, poses, viewpoints, occlusions and so on. To address these issues, researchers have proposed many visual descriptors. However, these visual features may be unstable in complicated environment. Comparatively, the semantic features can be a good supplement to visual feature descriptors for its robustness. As a kind of representative semantic features, color name is utilized in this paper. The color name is a semantic description of an image and shows good robustness to photometric variations. Traditional color name based methods are limited in discriminative power due to the finite color categories, only 11 or 16 kinds. In this paper, a new fine-grained color name approach based on bag-of-words model is proposed. Moreover, spatial information, with its advantage in strengthening constraints among features in variant environment, is further applied to optimize our method. Extensive experiments conducted on benchmark datasets have shown great superiorities of the proposed method.

Yang Yang, Yuhong Yang, Mang Ye, Wenxin Huang, Zheng Wang, Chao Liang, Lei Yao, Chunjie Zhang
Dealing with Ambiguous Queries in Multimodal Video Retrieval

Dealing with ambiguous queries is an important challenge in information retrieval (IR). While this problem is well understood in text retrieval, this is not the case in video retrieval, especially when multimodal queries have to be considered as for instance in Query-by-Example or Query-by-Sketch. Systems supporting such query types usually consider dedicated features for the different modalities. This can be intrinsic object features like color, edge, or texture for the visual modality or motion for the kinesthetic modality. Sketch-based queries are naturally inclined to be ambiguous as they lack specification in some information channels. In this case, the IR system has to deal with the lack of information in a query, as it cannot deduce whether this information should be absent in the result or whether it has simply not been specified, and needs to properly select the features to be considered. In this paper, we present an approach that deals with such ambiguous queries in sketch-based multimodal video retrieval. This approach anticipates the intent(s) of a user based on the information specified in a query and accordingly selects the features to be considered for query execution. We have evaluated our approach based on Cineast, a sketch-based video retrieval system. The evaluation results show that disregarding certain features based on the anticipated query intent(s) can lead to an increase in retrieval quality of more than 25 % over a generic query execution strategy.

Luca Rossetto, Claudiu Tănase, Heiko Schuldt
Collaborative Q-Learning Based Routing Control in Unstructured P2P Networks

Query routing among peers whilst locating required resources is still an acute issue discussed P2P networking, especially in unstructured P2P networks. Such an issue becomes worse when there is frequent in and out movement of the peers in the network and also with node failures. We propose a new method to assure alternative routing path to balance the query loads among the peers under higher network churns. The proposed collaborative Q-learning method learns the networks parameters such as processing capacity, number of connections, and number of resources in the peers, along with their state of congestion. By this technique, peers are avoided to forward queries to the congested peers. Our simulation results show that the required resources are located more quickly and queries in the whole network are also balanced. Also our proposed protocol exhibits more robustness and adaptability under high network churns and heavy workloads than that of the random walk method.

Xiang-Jun Shen, Qing Chang, Jian-Ping Gou, Qi-Rong Mao, Zheng-Jun Zha, Ke Lu
Backmatter
Metadaten
Titel
MultiMedia Modeling
herausgegeben von
Qi Tian
Nicu Sebe
Guo-Jun Qi
Benoit Huet
Richang Hong
Xueliang Liu
Copyright-Jahr
2016
Electronic ISBN
978-3-319-27671-7
Print ISBN
978-3-319-27670-0
DOI
https://doi.org/10.1007/978-3-319-27671-7

Neuer Inhalt