Skip to main content

About this book

The two-volume set LNCS 8935 and 8936 constitutes the thoroughly refereed proceedings of the 21st International Conference on Multimedia Modeling, MMM 2015, held in Sydney, Australia, in January 2015. The 49 revised regular papers, 24 poster presentations, were carefully reviewed and selected from 189 submissions. For the three special session, a total of 18 papers were accepted for MMM 2015. The three special sessions are Personal (Big) Data Modeling for Information Access and Retrieval, Social Geo-Media Analytics and Retrieval and Image or video processing, semantic analysis and understanding. In addition, 9 demonstrations and 9 video showcase papers were accepted for MMM 2015. The accepted contributions included in these two volumes represent the state-of-the-art in multimedia modeling research and cover a diverse range of topics including: Image and Video Processing, Multimedia encoding and streaming, applications of multimedia modelling and 3D and augmented reality.

Table of Contents


Image and Video Processing

An Efficient Hybrid Steganography Method Based on Edge Adaptive and Tree Based Parity Check

A major requirement for any steganography method is to minimize the changes that are introduced to the cover image by the data embedding process. Since the Human Visual System (HVS) is less sensitive to changes in sharp regions compared to smooth regions, edge adaptive has been proposed to discover edge regions and enhance the quality of the stego image as well as improve the embedding capacity. However, edge adaptive does not apply any coding scheme, and hence it embedding efficiency may not be optimal. In this paper, we propose a method that enhances edge adaptive by incorporating the Tree-Based Parity Check (TBPC) algorithm, which is a well-established coding-based steganography method. This combination enables not only the identification of potential pixels for embedding, but it also enhances the embedding efficiency through an efficient coding mechanism. More specifically, the method identifies the embedding locations according to the difference value between every two adjacent pixels, that form a block, in the cover image, and the number of embedding bits in each block is determined based on the difference between its two pixels. The incorporation of TBPC minimizes the modifications of the cover image, as it changes no more than two bits out of seven pixel bits when embedding four secret bits. Experimental results show that the proposed scheme can achieve both large embedding payload and high embedding efficiency.

Hayat Al-Dmour, Noman Ali, Ahmed Al-Ani

Secure Client Side Watermarking with Limited Key Size

In this paper, a novel secure client side watermarking scheme is proposed. We illustrate the reasons why a traditional watermarking scheme may not suit for a large digital-content broadcasting service and the ways how to fix the associated problems, in the proposed approach. Since the newly proposed scheme can support both visible and invisible watermarks, it is applicable to Cloud-based digital-content broadcasting services, such as Bookstores in Amazon and Google, iTunes in Apple, and Streaming Media in Netflix.

Jia-Hao Sun, Yu-Hsun Lin, Ja-Ling Wu

Orderless and Blurred Visual Tracking via Spatio-temporal Context

In this paper, a novel and robust method which exploits the spatiotemporal context for orderless and blurred visual tracking is presented. This lets the tracker adapt to both rigid and deformable objects on-line even if the image is blurred. We observe that a RGB vector of an image which is resized into a small fixed size can keep enough useful information. Based on this observation and computational reasons, we propose to resize the windows of both template and candidate target images into 2×2 and use Euclidean Distance to compute the similarity between these two RGB image vectors for the preliminary screening. We then apply spatio-temporal context based on Bayesian framework to further compute a confidence map for obtaining the best target location. Experimental results on challenging video sequences in MATLAB without code optimization show the proposed tracking method outperforms eight state-of-the-art methods.

Manna Dai, Peijie Lin, Lijun Wu, Zhicong Chen, Songlin Lai, Jie Zhang, Shuying Cheng, Xiangjian He

Coupled Discriminant Multi-Manifold Analysis with Application to Low-Resolution Face Recognition

The problem of matching a low-resolution (LR) face image to a gallery of high-resolution (HR) face images is addressed in this letter. Previous research has focused on introducing a learning based super-resolution (LBSR) method before matching or transforming LR and HR faces into a unified feature space (UFS) for matching. To identify LR faces, we present a method called coupled discriminant multi-manifold analysis (CDMMA). In CDMMA, we first explore the neighborhood information as well as local geometric structure of the multi-manifold space spanned by the samples. And then, we explicitly learn two mappings to project LR and HR faces to a unified discriminative feature space (UDFS) through a supervised manner, where the discriminative information is maximized for classification. After that, the conventional classification method is applied in the CDMMA for final identification. Experimental results conducted on two standard face recognition databases demonstrate the superiority of the proposed CDMMA.

Junjun Jiang, Ruimin Hu, Zhen Han, Liang Chen, Jun Chen

Text Detection in Natural Images Using Localized Stroke Width Transform

How to effectively and efficiently detect texts in natural scene images is a challenging problem. This paper presents a novel text detection method using localized stroke width transform. Due to the utilization of an adaptive image binarization approach and the implementation of stroke width transform in local regions, our method markedly reduces the demand of contrast between texts and backgrounds, and becomes considerably robust against edge detection results. Experiments on the dataset of ICDAR 2013 robust reading competition demonstrate that the proposed method outperforms other state-of-the-art approaches in the application of text detection in natural scene images.

Wenyan Dong, Zhouhui Lian, Yingmin Tang, Jianguo Xiao

Moving Object Tracking with Structure Complexity Coefficients

Target appearance change during tracking is always a challenging problem for visual object tracking. In this paper, we present a novel visual object tracking algorithm based on Structure Complexity Coefficients (SCC) in addressing the motion related appearance change problem fundamentally. Based on our careful analysis, we found that the motion related appearance change is quite related to the SCC of target surface, where the appearance of complex structural regions is easier to change comparing with that of smooth structural regions with target motion. With the proposed SCC, a SCC-GL distance is defined in addressing both the appearance change and occlusion related problems during tracking. Moreover, an Observation Dependent Hidden Markov Model (OD-HMM) framework is designed where the observation dependency between neighboring frames is considered comparing with the standard HMM based tracking framework. The observation dependency is computed with the proposed SCC. We also present a novel outlier removing method in appearance model updating to avoid error accumulation. Experimental results on various challenging video sequences demonstrate that the proposed observation dependent tracker (


) achieves better performance than existing related tracking algorithms.

Yuan Yuan, Yuming Fang, Lin Weisi

Real-Time People Counting across Spatially Adjacent Non-overlapping Camera Views

Counting the number of people traveling across non- overlapping camera views generally requires all persons exiting any camera view to be re-identified when they re-enter one of its spatially adjacent camera views. For their accurate re-identification, the correspondence among the exits and entries of all persons should be established so that their total correspondence confidence is maximized. In order to realize the real-time people counting, we propose to find the shortest time window to observe both the exits and entries of all persons traveling within the time window adaptively to the current people traffic flow. Further, since closely related people often travel together, the re-identification can be performed to the foreground regions to re-identify groups of people. Since the groups of people can sometimes split or merge outside the camera views, the proposed method establishes the weighted correspondence among the exits and entries of the foreground regions based on their correspondence confidence. Experimental results have shown that the adaptively determined time window was effective in terms of both the accuracy and the delay in people counting and the weighted correspondence was effective in terms of the accuracy especially when the people traffic gets congested and groups of people split/merge outside the camera views.

Ryota Akai, Naoko Nitta, Noboru Babaguchi

Binary Code Learning via Iterative Distance Adjustment

Binary code learning techniques have recently been actively studied for hashing based nearest neighbor search in computer vision applications due to its merit of improving hashing performance. Currently, hashing based methods can obtain good binary codes but some data may suffer from the problem of being mapped to inappropriate Hamming codes. To address this issue, this paper proposes a novel binary code learning method via iterative distance adjustment to improve traditional hashing methods, in which we utilize very short additional binary bits to correct the spatial relationship among data points and thus enhance the similarity-preserving power of binary codes. We carry out image retrieval experiments on the well-recognized benchmark datasets to validate the proposed method. The experimental results have shown that the proposed method achieves better hashing performance than the state-of-the-art binary code learning methods.

Zhen-fei Ju, Xiao-jiao Mao, Ning Li, Yu-bin Yang

What Image Classifiers Really See – Visualizing Bag-of-Visual Words Models

Bag-of-Visual-Words (BoVW) features which quantize and count local gradient distributions in images similar to counting words in texts have proven to be powerful image representations. In combination with supervised machine learning approaches, models for nearly every visual concept can be learned. BoVW feature extraction, however, is performed by cascading multiple stages of local feature detection and extraction, vector quantization and nearest neighbor assignment that makes interpretation of the obtained image features and thus the overall classification results very difficult. In this work, we present an approach for providing an intuitive heat map-like visualization of the influence each image pixel has on the overall classification result. We compare three different classifiers (AdaBoost, Random Forest and linear SVM) that were trained on the Caltech-101 benchmark dataset based on their individual classification performance and the generated model visualizations. The obtained visualizations not only allow for intuitive interpretation of the classification results but also help to identify sources of misclassification due to badly chosen training examples.

Christian Hentschel, Harald Sack

Coupled-View Based Ranking Optimization for Person Re-identification

Person re-identification aims to match different persons observed in non-overlapping camera views. Researchers have proposed many person descriptors based on global or local descriptions, while both of them have achieved satisfying matching results, however, their ranking lists usually vary a lot for the same query person. These motivate us to investigate an approach to aggregate them to optimize the original matching results. In this paper, we proposed a coupled-view based ranking optimization method through cross KNN rank aggregation and graph-based re-ranking to revise the original ranking lists. Its core assumption is that the images of the same person should share the similar visual appearance in both global and local views. Extensive experiments on two datasets show the superiority of our proposed method with an average improvement of 20-30% over the state-of-the-art methods at CMC@1.

Mang Ye, Jun Chen, Qingming Leng, Chao Liang, Zheng Wang, Kaimin Sun

Wireless Video Surveillance System Based on Incremental Learning Face Detection

As an important supplement to wired video in video surveillance applications, wireless video has taken increasing attentions and has been extensively applied into projects like “Safe City”. Despite of in taxis, buses, emergency command vehicles, or temporary monitory point, there will definitely produce massive surveillance videos. In order to retrieve and browse these videos in an efficient way, key video browsing system and technique based on face detection is accepted and put into promotion. Face detection is widely studied and used in many practical applications; however, because of the distinct features of different factors in experiments and applications, such as orientation, pose, illumination, etc., challenges usually obstruct the practical usage. To perform successful application in wireless video browsing system, this paper proposes an incremental learned face detection method based on auto-captured samples. Experiments demonstrate that our proposed incremental learning algorithm has favorable face detection performance and can work in the proposed system.

Wenjuan Liao, Dingheng Zeng, Liguo Zhou, Shizheng Wang, Huicai Zhong

An Automatic Rib Segmentation Method on X-Ray Radiographs

In this paper, an automatic rib recognition method based on image processing and data mining is presented. Firstly, multiple template matching and graph based methods are used to detect rib center line; then, the support vector machine is used to build a rib relative position model and identify the error recognition results; finally, decision trees are employed to refine the center line recognition result. The JSRT database is employed to test our method. The result of rib recognition is over 92% for sensitivity and 98% for specificity.

Xuechen Li, Suhuai Luo, Qingmao Hu

Content-Based Discovery of Multiple Structures from Episodes of Recurrent TV Programs Based on Grammatical Inference

TV program structuring is essential for program indexing and retrieval. Practically, various types of programs lead to a diversity of program structures. Besides, several episodes of a recurrent program might exhibit different structures. Previous work mostly relies on supervised approaches by adopting prior knowledge about program structures. In this paper, we address the problem of unsupervised program structuring with minimal domain knowledge. We propose an approach to identify multiple structures and infer structural grammars for recurrent TV programs of different types. It involves three sub-problems: i) we determine the structural elements contained in programs with minimal knowledge about which type of elements may be present; ii) we identify multiple structures for the programs if any and model the structures; iii) we generate the structural grammar for each corresponding structure. Finally, we conduct use cases on real recurrent programs of three different types to demonstrate the effectiveness of proposed approach.

Bingqing Qu, Félicien Vallet, Jean Carrive, Guillaume Gravier

FOCUSING PATCH: Automatic Photorealistic Deblurring for Facial Images by Patch-Based Color Transfer

Facial image synthesis creates blurred facial images almost without high-frequency components, resulting in flat edges. Moreover, the synthesis process results in inconsistent facial images, such as the conditions where the white part of the eye is tinged with the color of the iris and the nasal cavity is tinged with the skin color. Therefore, we propose a method that can deblur an inconsistent synthesized facial image, including strong blurs created by common image morphing methods, and synthesize photographic quality facial images as clear as an image captured by a camera. Our system uses two original algorithms: patch color transfer and patch-optimized visio-lization. Patch color transfer can normalize facial luminance values with high precision, and patch-optimized visio-lization can synthesize a deblurred, photographic quality facial image. The advantages of our method are that it enables the reconstruction of the high-frequency components (concavo-convex) of human skin and removes strong blurs by employing only the input images used for original image morphing.

Masahide Kawai, Shigeo Morishima

Efficient Compression of Hyperspectral Images Using Optimal Compression Cube and Image Plane


(HS) images (HSI) provide a vast amount of spatial and spectral information based on the high dimensionality of the pixels in a wide range of wavelengths. A HS image usually requires massive storage capacity, which demands high compression rates to save space with preservation of data integrity. HS image can be deemed as three dimensional data cube where different wavelengths (


) form the third dimension along with




dimensions. To get a better compression result, spatial redundancy of HS images can be exploited using different coders along




, or


direction. This article focuses on taking maximum advantage of HS images redundancy by rearranging HS image into different 3D data cubes and proposes a directionlet based compression scheme constituted the

optimal compression plane

(OCP) for adaptive best approximation of geometric matrix. The OCP, calculated by the spectral correlation, is used to the prediction and determination of which reconstructed plane can reach higher compression rates while minimizing data loss of hyperspectral data. Moreover, we also rearrange the 3D data cube into different 2D image planes and investigate the compression ratio using different coders. The schema can be used for both lossless and lossy compression. Our experimental results show that the new framework optimizes the performance of the compression using a number of coding methods (inclusive of lossless/lossy HEVC, motion JPEG, JPG2K, and JPEG) for HSIs with different visual content.

Rui Xiao, Manoranjan Paul

Automatic Chinese Personality Recognition Based on Prosodic Features

Many researches based on the English, French and German language have been done on the relationship between personality and speech with some relevant conclusions. Due to the difference between Chinese and other languages in pronunciation of acoustic characteristics, Chinese personalities and westerners, we put forward the Chinese and his personality prediction research in view. During the study, we collected 1936 speech pieces and their Big Five questionnaires from 78 Chinese. Built models for male and female with arguments of prosodic features such as pitch, intensity, formants and speak rate. Experiments’ result shows: (1) the third formant has the same effect as the first two in prediction of personality; (2) combination of pitch, intensity, formants and speak rate as classification parameters can achieve higher classification accuracy(more than 80%) than in single prosodic feature.

Huan Zhao, Zeying Yang, Zuo Chen, Xixiang Zhang

Robust Attribute-Based Visual Recognition Using Discriminative Latent Representation

Recent work in visual recognition have addressed attribute-based classification. However, semantic attributes that are designed and labeled by humans generally contain some noise, and have weak learnability for classifiers and discrimination between categories. As a fine supplement to semantic attribute, data-driven attribute learned from training data suffers from the ineffectiveness in novel category classification with no or few samples. In this paper, we introduce the Discriminative Latent Attribute (DLA) as a mid-level representation, which has connection with both visual low-level feature and semantic attribute through matrix factorization. Furthermore, we propose a novel unified formulation to efficiently train category-DLA matrix and attribute classifiers together, which makes DLA more learnable and more discriminative between categories. Our experiments show the effectiveness and robustness of our approach which outperforms the state-of-the-art approach in zero-shot learning task.

Yuqi Wang, Yunfei Gong, Qiang Liu

An Analysis of Time Drift in Hand-Held Recording Devices

Automatic synchronization of audio and video recordings from events like music concerts, sports, or speeches, gathered from heterogeneous sources like smartphones and digital cameras, is an interesting topic with lots of promising use-cases. There are already many published methods, unfortunately none of them takes time drift into account. Time drift is inherent in every recording device, resulting from random and systematic errors in oscillators. This effect leads to audio and video sampling rates deviating from their nominal rates, effectively leading to different playback speeds of parallel recordings, with deltas measured up to 60 ms/min. In this paper, we present experiments and measurements showing that time drift is an existing problem that cannot be ignored when good quality results are demanded. Therefore, it needs to be taken care of in future synchronization methods and algorithms.

Mario Guggenberger, Mathias Lux, Laszlo Böszörmenyi

A Real-Time People Counting Approach in Indoor Environment

Due to complex background information, shadow and occlusions, it is difficult to count people accurately. In this paper, we propose a fast and robust human counting approach in indoor space. Firstly, we use foreground object extraction to remove background information. In order to get both moving people and stationary people, we designed a block-updating way to update the background model. Secondly, we train a multi-view head-shoulder model to find candidate people, and an improved k-means clustering is proposed to locate the position of each people. Finally, a temporal filter with frame-difference is used to refine the counting results and detect noise, such as double-count, random disturbance. An indoor people dataset is recorded in the classroom of our university. Experiments and comparison show the promise of the proposed approach.

Jun Luo, Jinqiao Wang, Huazhong Xu, Hanqing Lu

Multi-instance Feature Learning Based on Sparse Representation for Facial Expression Recognition

Usually, sparse representation is adopted to learn the intrinsic structure in label spaces to fulfil recognition tasks. In this paper, we propose a feature learning scheme based on sparse representation and validate its effectiveness taking facial expression recognition as a multi-instance learning problem. By introducing the sparse constraint with



sparse regularization, the proposed model learns the instance-specific feature based on label variance information. In this paper, we propose two schemes for denoting the label variance in multi-instance facial expression recognition. Experimental analysis shows that the sparse constraint is useful in feature learning when label variance is properly expressed and utilized. We successfully obtain the stable structure in the feature spaces with the sparse representation based on multi-instance feature learning.

Yuchun Fang, Lu Chang

Object Detection in Low-Resolution Image via Sparse Representation

We propose a novel object detection framework in extreme Low-Resolution (LR) images via sparse representation. Object detection in extreme LR images is very important for some specific applications such as abnormal event detection, automatic criminal investigation from surveillance videos. Object detection has achieved much progress in computer vision, but it is still a challenging task in LR image, because traditional discriminative features in high resolution usually disappear in low resolution. The precision of the detector in LR will decrease by a large margin. Our model uses sparse coding of part filters to represent each filter as a sparse linear combination of shared dictionary elements. The main contribution of this paper: 1) the object detection framework in extreme LR is proposed by detecting objects in reconstructed HR image; 2) a mapping function from LR patches to High-Resolution (HR) patches will be learned by a local regression algorithm called sparse support regression, which can be constructed from the support based of the LR-HR dictionary; 3) a novel feature extraction method is proposed to accelerate by extracting visual features from HR dictionary atoms. Our approach has produced better performance for object detection than state-of-the-art methods. Testing our method from INRIA and PASCAL VOC 2007 datasets has revealed similar improvements, suggesting that our approach is suitable for general object detection applications.

Wenhua Fang, Jun Chen, Chao Liang, Xiao Wang, Yuanyuan Nan, Ruimin Hu

A Novel Fast Full Frame Video Stabilization via Three-Layer Model

Video stabilization is an important video enhancement technology which aims at removing undesired shaking from input videos. A challenging task in stabilization is to inpaint the missing pixels of undefined areas in the motion-compensated frames. This paper describes a new video stabilization method. It adopts a multi-layer model to improve the efficiency of the video stabilization. The undefined areas can be inpainted in real-time. Compared with traditional methods, our proposed algorithm only need maintain a single updated mosaic image for video completion, while previous methods require to store all neighboring frames and then registered with the current frame. The experimental results demonstrated the effectiveness of the proposed approach.

Wei Long, Jie Yang, Dacheng Song, Xiaogang Chen, Xiangjian He

Multimedia Mining and Retireval

Cross-Modal Self-Taught Learning for Image Retrieval

In recent years, cross-modal methods have been extensively studied in the multimedia literature. Many existing cross-modal methods rely on labeled training data which is difficult to collect. In this paper we propose a cross-modal self-taught learning (CMSTL) algorithm which is learned from unlabeled multi-modal data. CMSTL adopts a two-stage self-taught scheme. In the multi-modal topic learning stage, both intra-modal similarity and multi-modal correlation are preserved. And different modalities have different weights to learn the mutli-modal topics. In the projection stage, soft assignment is used to learn projection functions. Experimental results on Wikipedia articles and NUS-WIDE show the effectiveness of CMSTL in both cross-modal retrieval and image hashing.

Liang Xie, Peng Pan, Yansheng Lu, Sheng Jiang

Multimedia Social Event Detection in Microblog

Event detection in social media platforms has become an important task. It facilities exploration and browsing of events with early plans for preventive measures. The main challenges in event detection lie in the characteristics of social media data, which are short/conversational, heterogeneous and live. Most of existing methods rely only on the textual information while ignoring the visual content as well as the intrinsic correlation among the heterogeneous social media data. In this paper, we propose an event detection method, which generates an intermediate semantic entity, named

microblog clique

(MC), to explore the highly correlated information among the noisy and short microblogs. The heterogeneous social media data is formulated as a hypergraph and the highly correlated ones are grouped to generate the MCs. Based on these MCs, a bipartite graph is constructed and partitioned to detect social events. The proposed method has been evaluated on the Brand-Social-Net dataset. Experimental results and comparison with state-of-the-art methods demonstrate the effectiveness of the proposed approach. Further evaluation has shown that the use of the visual content can significantly improve the event detection performance.

Yue Gao, Sicheng Zhao, Yang Yang, Tat-Seng Chua

A Study on the Use of a Binary Local Descriptor and Color Extensions of Local Descriptors for Video Concept Detection

In this work we deal with the problem of how different local descriptors can be extended, used and combined for improving the effectiveness of video concept detection. The main contributions of this work are: 1) We examine how effectively a binary local descriptor, namely ORB, which was originally proposed for similarity matching between local image patches, can be used in the task of video concept detection. 2) Based on a previously proposed paradigm for introducing color extensions of SIFT, we define in the same way color extensions for two other non-binary or binary local descriptors (SURF, ORB), and we experimentally show that this is a generally applicable paradigm. 3) In order to enable the efficient use and combination of these color extensions within a state-of-the-art concept detection methodology (VLAD), we study and compare two possible approaches for reducing the color descriptor’s dimensionality using PCA. We evaluate the proposed techniques on the dataset of the 2013 Semantic Indexing Task of TRECVID.

Foteini Markatopoulou, Nikiforos Pittaras, Olga Papadopoulou, Vasileios Mezaris, Ioannis Patras

Content-Based Image Retrieval with Gaussian Mixture Models

Among the various approaches of content-based image modeling, generative models have become prominent due to their ability of approximating feature distributions with arbitrary accuracy. A frequently encountered generative model for the purpose of content-based image retrieval is the Gaussian mixture model which facilitates the application of various dissimilarity measures. The question of which dissimilarity measure provides the highest retrieval performance in terms of accuracy and efficiency is still an open research question. In this paper, we propose an empirical investigation of dissimilarity measures for Gaussian mixture models based on high-dimensional local feature descriptors. To this end, we include a unifying overview of state-of-the-art dissimilarity measures applicable to Gaussian mixture models along with an extensive performance analysis on a multitude of local feature descriptors. Our findings will help to guide further research in the field of content-based image modeling with Gaussian mixture models.

Christian Beecks, Merih Seran Uysal, Thomas Seidl

Improving Interactive Known-Item Search in Video with the Keyframe Navigation Tree

In this paper we propose the Keyframe Navigation Tree (KNT) as navigational aid in video for interactive search. The KNT is a hierarchical visualization of keyframes that can compactly represent the content of a video with different levels of details. It can be used as an alternative, or in addition, to a common seeker-bar of a video player. Through a user study with 20 participants we show that the proposed navigation approach not only allows significantly faster interactive search in video than a common video player, but also requires significantly less effort (also less mental and physical load) and is much more enjoyable to use.

Marco A. Hudelist, Klaus Schoeffmann, Qing Xu

Large-Scale Image Mining with Flickr Groups

The availability of large annotated visual resources, such as ImageNet, recently led to important advances in image mining tasks. However, the manual annotation of such resources is cumbersome. Exploiting Web datasets as a substitute or complement is an interesting but challenging alternative. The main problems to solve are the choice of the initial dataset and the noisy character of Web text-image associations. This article presents an approach which first leverages Flickr groups to automatically build a comprehensive visual resource and then exploits it for image retrieval. Flickr groups are an interesting candidate dataset because they cover a wide range of user interests. To reduce initial noise, we introduce innovative and scalable image reranking methods. Then, we learn individual visual models for 38,500 groups using a low-level image representation. We exploit off-the-shelf linear models to ensure scalability of the learning and prediction steps. Finally,


image descriptions are obtained by concatenating prediction scores of individual models and by retaining only the most salient responses. To provide a comparison with a manually created resource, a similar pipeline is applied to ImageNet. Experimental validation is conducted on the ImageCLEF Wikipedia Retrieval 2010 benchmark, showing competitive results that demonstrate the validity of our approach.

Alexandru Lucian Ginsca, Adrian Popescu, Hervé Le Borgne, Nicolas Ballas, Phong Vo, Ioannis Kanellos

FISIR: A Flexible Framework for Interactive Search in Image Retrieval Systems

This paper presents a flexible framework for interactive search in image retrieval systems. Our approach allows for the visual structure change in order to produce coherent layouts, which highlight the most relevant results according to user needs. This innovative framework is flexible in the sense it supports the dynamic creation of several hybrid visual designs, based on the combination of different visualization strategies. Results from a subjective evaluation demonstrate that the dynamic hybrid layouts created by the proposed framework provide the end-users with an effective user interface for intuitive browsing and searching experience.

Sheila M. Pinto-Cáceres, Jurandy Almeida, M. Cecília C. Baranauskas, Ricardo da S. Torres

Auditory Scene Classification with Deep Belief Network

Effective modeling and analyzing of an auditory scene is crucial to many context-aware and content-based multimedia applications. In this paper, we explore the effectiveness of the multiple-layer generative deep neural network model in discovering the underlying higher level and highly non-linear probabilistic representations from acoustic data of the unstructured auditory scenes. We first create a more compact and representative description of the input audio clip by focusing on the salient regions of data and modeling their contextual correlations. Next, we exploit deep belief network (DBN) to unsupervisedly discover and generate the high-level descriptions of scene audio as the activations of units on higher hidden layers of the trained DBN model, which are finally classified to certain category of scene by either the discriminative output layer of DBN or a separate classifier like support vector machine (SVM). The experiment reveals the effectiveness of the proposed DBN-based classification approach for auditory scenes.

Like Xue, Feng Su

An Improved Content-Based Music Recommending Method with Weighted Tags

Content-based filtering is widely used in music recommendation field. However, the performance of existing content-based methods is dissatisfactory, because those methods simply divided the listened songs into like or unlike set, and ignored user’s preference degree. In this paper, an enhanced content-based music recommending method was proposed by quantifying the user preference degree to songs with weighted tags. Firstly, each listened song was classified into like or unlike set according to user’s playing behaviors, such as skipping and repeating. Secondly, the songs’ social tags were collected from LastFm website and weighted according to their frequency in the collected tags.Finally, the user’s preference degree for each song was quantified with the weighted tags, and the candidate songs with high preference degrees would be recommended to him. On the LastFm dataset, the experimental results demonstrate that the proposed method outperforms those traditional content-based methods in both rating and ranking prediction.

Lu Ding, Ning Zheng, Jiang Xu, Ming Xu

A Unified Model for Socially Interconnected Multimedia-Enriched Objects

Enabling effective multimedia information processing, analysis, and access applications in online social multimedia settings requires data representation models that capture a broad range of the characteristics of such environments and ensure interoperability. We propose a flexible model for describing Socially Interconnected MultiMedia-enriched Objects (SIMMO) that integrates in a unified manner the representation of multimedia and social features in online environments. Its specification is based on a set of identified requirements and its expressive power is illustrated using several diverse examples. Finally, a comparison of SIMMO with existing approaches demonstrates its unique features.

Theodora Tsikrika, Katerina Andreadou, Anastasia Moumtzidou, Emmanouil Schinas, Symeon Papadopoulos, Stefanos Vrochidis, Ioannis Kompatsiaris

Concept-Based Multimodal Learning for Topic Generation

In this paper, we propose a concept-based multimodal learning model (CMLM) for generating document topic through modeling textual and visual data. Our model considers cross-modal concept similarity and unlabeled image concept, it is capable of processing document which has modality missing. The model can extract semantic concepts from unlabeled image and combine with text modality to generate document topics. Our comparison experiments on news document topic generation shows, in multimodal scenario, CMLM can generate more representative topics than latent dirichet allocation (LDA) based topic for representing given document.

Cheng Wang, Haojin Yang, Xiaoyin Che, Christoph Meinel

Audio Secret Management Scheme Using Shamir’s Secret Sharing

Audio Secret Sharing (ASS) is a technique used to protect audio data from tampering and disclosure by dividing it into shares such that qualified shares can reconstruct the original audio data. Existing ASS schemes encrypt binary secret messages and rely on the human auditory system for decryption by simultaneously playing authorized shares. This decryption approach tends to overburden the human auditory system when the number of shares used to reconstruct the secret increases [3]. Furthermore, it does not create room for further analysis or computation to be performed on the reconstructed secret since decryption ends at the human auditory system. Additionally, schemes in [2], [3], [4], [6] do not extend to the general (k, n) threshold. In this paper we propose an ASS scheme based on Shamir’s secret sharing, which is (k, n) threshold, ideal2, and information theoretically secure and it provides computationally efficient decryption.

M. Abukari Yakubu, Namunu C. Maddage, Pradeep K. Atrey

Live Version Identification with Audio Scene Detection

This paper presents a live version music identification system by modifying the conventional cover song identification system. The proposed system includes two stages: a live version identification phase and an audio scene-detection phase. We improve the accuracy of the system by weighting similarity scores in the live version identification phase and discriminating scenes by using RMS, pulse clarity and similarity scores. Results show that the proposed method performs better than the previous method. The final algorithm achieves 70% accuracy on average.

Kazumasa Ishikura, Aiko Uemura, Jiro Katto

Community Detection Based on Links and Node Features in Social Networks

Community detection is a significant but challenging task in the field of social network analysis. Many effective methods have been proposed to solve this problem. However, most of them are mainly based on the topological structure or node attributes. In this paper, based on SPAEM [1], we propose a joint probabilistic model to detect community which combines node attributes and topological structure. In our model, we create a novel feature-based weighted network, within which each edge weight is represented by the node feature similarity between two nodes at the end of the edge. Then we fuse the original network and the created network with a parameter and employ expectation-maximization algorithm (EM) to identify a community. Experiments on a diverse set of data, collected from Facebook and Twitter, demonstrate that our algorithm has achieved promising results compared with other algorithms.

Fengli Zhang, Jun Li, Feng Li, Min Xu, Richard Xu, Xiangjian He

Multimedia Encoding and Streaming

Scaling and Cropping of Wavelet-Based Compressed Images in Hidden Domain

With the rapid advancement of cloud computing, the use of third-party cloud datacenters for storing and processing (e.g, scaling and cropping) personal and critical images is becoming more common. For storage and bandwidth efficiency, the images are almost always compressed. Although cloud-based imaging has many advantages, security and privacy remain major issues. One way to address these two issues is to use Shamir’s (




) secret sharing-based secret image sharing schemes, which can distribute the secret image among


number of participants in such a way that no less than






) participants can know the image content. Existing secret image sharing schemes do not allow processing of a compressed image in the hidden domain. In this paper, we propose a scheme that can scale and crop a CDF (Cohen Daubechies Feauveau) wavelet-based compressed image (such as JPEG2000) in the encrypted domain by smartly applying secret sharing on the wavelet coefficients. Results and analyses show that our scheme is highly secure and has acceptable computational and data overheads.

Kshitij Kansal, Manoranjan Mohanty, Pradeep K. Atrey

MAP: Microblogging Assisted Profiling of TV Shows

Online microblogging services that have been increasingly used by people to share and exchange information, have emerged as a promising way to profiling multimedia contents, in a sense to provide users a socialized abstraction and understanding of these contents. In this paper, we propose a microblogging profiling framework, to provide a social demonstration of TV shows. Challenges for this study lie in two folds: First, TV shows are generally


, i.e., most of them are not originally from the Internet, and we need to create a


between these TV shows with online microblogging services; Second, contents in a microblogging service are extremely noisy for video profiling, and we need to strategically retrieve the most related information for the TV show profiling. To address these challenges, we propose a


, a microblogging-assisted profiling framework, with contributions as follows: i) We propose a joint user and content retrieval scheme, which uses information about both actors and topics of a TV show to retrieve related microblogs; ii) We propose a social-aware profiling strategy, which profiles a video according to not only its content, but also the social relationship of its microblogging users and its propagation in the social network; iii) We present some interesting analysis, based on our framework to profile real-world TV shows.

Xiahong Lin, Zhi Wang, Lifeng Sun

Improved Rate-Distortion Optimization Algorithms for HEVC Lossless Coding

To avoid distortion, the quantization is not implemented on residues for lossless mode in HEVC. As a result, the conventional lambda model in Rate-Distortion Optimization (RDO), where lambda is related to the quantization parameter (QP), is unreasonable for lossless coding. This paper first demonstrates the role that lambda value plays in the rough RDO of HEVC lossless coding, and a Simulated Annealing algorithm based approach is proposed to obtain the most appropriate lambda for each largest coding unit. Considering the computational complexity, the other simplified method using least square errors prediction is proposed to improve the rough RDO process. Experimental results reveal that on top of the lossless coding mode, the improved method offers the performance with a 1.0%, 1.3% and 1.1% bit-rate reduction on average for Random-access Main, Low-delay B Main and Low-delay P Main configurations, respectively, while brings negligible increases of computational complexity in the encoder.

Fangdong Chen, Houqiang Li

A Novel Error Concealment Algorithm for H.264/AVC

To benefit network transmission, the bit stream of the whole frame coded by H.264/AVC is usually grouped into one packet. However, the packet loss during transmission will lead to the distortion of the reconstructed video and the error propagation. To deal with this problem, error concealment (EC) strategy is widely used to recover the lost frames and to weaken the effect of error propagation. In this paper, we propose a new EC method which uses both the forward and backward motion information to recover the motion information of the current lost frame. Besides, for the pixels whose motion information is quite different compared with its neighboring pixels, we propose to use the spatial correlation to fill up the pixels by minimizing the total variation (TV) norm. With the help of our proposed algorithm, we can obtain the motion vectors of the lost frames and improve the accuracy of motion vectors derived by the unidirection recovery method. Besides, the optimization strategy, minimizing the TV-norm for the pixels with quite different motion vectors, can help the decoder to recover the reconstructed frames more accurately. Experimental results show that the proposed algorithm can achieve both better objective performance and subjective performance compared to well-known schemes.

Jinlei Zhang, Houqiang Li

Edge Direction-Based Fast Coding Unit Partition for HEVC Screen Content Coding

High efficiency Video Coding is a new video coding standard that presents numerous advantages over previous video coding standards. However, Rate distortion optimization (RDO) complexity is extremely high for screen content coding, which cannot adjust to the real-time performance. This paper proposed a fast and efficient algorithm based on edge direction to partition coding units (CUs) based on their relationship with edge direction. Sobel operator is used to determine the edge direction from the total image before intra prediction. The key point of this algorithm is to determine the relationship between the edge direction and CUs. Experimental results show that the proposed edge direction-based CU partition algorithm provides a decrease in the screen content coding processing time up to 39%, with a little increase in bit-rate(0.7% on average) and a negligible reduction in the PSNR value.

Mengmeng Zhang, Yangxiao Ou

Signal-Aware Parametric Quality Model for Audio and Speech over IP Networks

This paper proposes a new signal-aware parametric quality assessment model for audio and speech over IP. The perceptual importance of the reproducing packets and relevant neighborhoods are included in the model. The model is developed from a built audio and speech quality assessment framework using the Artificial Neural Networks(ANN). The overall quality is evaluated by combining the signal-aware parameters with network parameters using a large set of audio and speech samples. It is shown that the signal-aware approach gains higher correlation with the improved PEAQ outputs compared with other parametric method such as E-model.

SongBo Xie, Yuhong Yang, Ruimin Hu, Yanye Wang, Hongjiang Yu, ShaoLong Dong, Li Gao, Cheng Yang

3D and Augmented Reality

Patch-Based Disparity Remapping for Stereoscopic Images

Post-production and processing for stereoscopic 3D are attracting a lot attention in recent years. In particular, the acquired disparity in most situations requires further manipulation to adjust different view conditions. This paper proposes a novel method to address the issue of disparity remapping of stereoscopic images. We present a nonlinear disparity mapping model to adjust the depth range of the whole image as well as the special visual important regions. To implement this model, our method compute saliency maps for the stereoscopic images. Then we extend the PatchMatch algorithm to search for the proper patches in both the left and the right images by visual combined constraints, and use them to iteratively refine the images to meet the target depth range. Our method is capable of minimizing the distortion of the images and ensuring the correct stereo consistency after disparity remapping. The experimental results demonstrate that the proposed approach can adjust the depth range to improve the stereoscopic effects while preserving the naturalness of the scene.

Dawei Lu, Huadong Ma, Liang Liu, Huiyuan Fu

3D Depth Perception from Single Monocular Images

Depth perception from single monocular images is a challenging problem in computer vision. Since the single image is lack of features of context, we only find all the cues from the local image. This paper presents a novel method for 3D depth perception from a single monocular image containing the ground to estimate the absolute depthmaps more accurately. Different from previous methods, in our method, we first generates the ground plane depth coordinate system from a single monocular image by image-forming principle, and then locates the objects in image with the coordinate system using the geometric characteristics. At last, we provide an method to estimate the accurate depthmaps. The experiments show that our method outperforms the state-of-the-art single-image depth perception methods both in relative depth perception and absolute depth perception.

Hang Xu, Kan Li, FuYu Lv, JianMeng Pei

Muscular Movement Model Based Automatic 3D Facial Expression Recognition

Facial expression is the most important channel for human nonverbal communication. This paper presents a novel and effective approach to automatic 3D Facial Expression Recognition, FER based on the Muscular Movement Model (MMM). In contrast to most of existing methods, MMM deals with such an issue in the viewpoint of anatomy. It first automatically segments the input face by localizing the corresponding points around each muscular region of the reference face using Iterative Closest Normal Pattern (ICNP). A set of shape features of multiple differential quantities, including coordinates, normals and shape index values, are then extracted to describe the geometry deformation of each segmented region. Therefore, MMM tends to combine both the advantages of the model based techniques as well as the feature based ones. Meanwhile, we analyze the importance of these muscular areas, and a score level fusion strategy which optimizes the weights of the muscular areas by using a Genetic Algorithm (GA) is proposed in the learning step. The muscular areas with their optimal weights are finally combined to predict the expression label. The experiments are carried out on the BU-3DFE database, and the results clearly demonstrate the effectiveness of the proposed method.

Qingkai Zhen, Di Huang, Yunhong Wang, Liming Chen

Azimuthal Perceptual Resolution Model Based Adaptive 3D Spatial Parameter Coding

The spatial perceptual feature of human ears plays an important role on the quantization of spatial parameter. Human ears have the most sensitive feature for sound in frontal direction, less for the rear area, and least for the lateral sides. Traditional quantization method of spatial parameter for frontal channel pair in stereo spatial audio coding is not suitable for that of channel pairs in different directions such as lateral or rear directions in multichannel audio coding. An azimuthal perceptual resolution model based adaptive spatial parameter coding for 3D audio multichannel signals is proposed in this paper. Based on the omnibearing and non-uniform azimuthal perceptual resolution model of human ears for sound sources in different directions, quantization values of spatial parameters can be estimated adaptively according to the location and configuration of channel pairs in arbitrary directions. The density features of quantization steps for spatial parameters of channel pairs in arbitrary directions are corresponding to the non-uniform azimuthal perceptual resolution of human ears. So that the quantization noise can be effectively reduced under the directional perceptual threshold with the improved reproduced spatial sound quality.

Li Gao, Ruimin Hu, Yuhong Yang, Xiaocheng Wang, Weiping Tu, Tingzhao Wu

Flat3D: Browsing Stereo Images on a Conventional Screen

Expensive and cumbersome 3D equipment currently limits the popularization of emerging stereo media on the Internet. Particularly for stereo images, as a major kind of stereo media widespread on the Internet, there is not yet a good solution to show stereoscopy in conventional displays. By investigating the principles of human visual system (HVS), this paper proposes a method, called Flat3D for animating stereoscopy only through a conventional screen (2D) based on the motion parallax. The way for exhibition is dynamically transforming consistent views from left to right and then playing back reversely. The relative motion impresses viewers with strong depth perception. We investigate some factors which affect viewing experience in Flat3D and find that a reasonable fixation point and structure-preserved view transition contribute the most. Based on the above findings, we develop an adaptive fixation acquisition approach combining color and depth cues, as well as employing a probability-based view synthesis to generate the view sequences. Experiments which compared the above factors in and out of consideration show that our approach is a more convenient, effective and automatic alternative for browsing stereo images in common flat screens.

Wenjing Geng, Ran Ju, Xiangyang Xu, Tongwei Ren, Gangshan Wu

Online 3D Shape Segmentation by Blended Learning

This paper presents a novel online 3D shape segmentation framework, which blend two learning methods together: unsupervised clustering based method, and supervised progressive learning method. The features of this method lie in four aspects. Firstly, we use weighted online learning to train a segmentation model to achieve the blended learning framework. Secondly, we perform co-segmentation based on unsupervised clustering to analyze the shape set, and initialize this segmentation model. Thirdly, based on this segmentation model, users can segment new shapes by using supervised progressive learning method. And this segmentation model can also be incrementally updated by weighted online learning during the progressive segmentation. Finally, the segmentation of shapes in the initial set can be corrected based on the updated segmentation model. Experimental results demonstrate the effectiveness of our approach.

Feiqian Zhang, Zhengxing Sun, Mofei Song, Xufeng Lang

Factorizing Time-Aware Multi-way Tensors for Enhancing Semantic Wearable Sensing

Automatic concept detection is a crucial aspect of automatically indexing unstructured multimedia archives. However, the current prevalence of one-per-class detectors neglect inherent concept relationships and operate in isolation. This is insufficient when analyzing content gathered from wearable visual sensing, in which concepts occur with high diversity and with correlation depending on context. This paper presents a method to enhance concept detection results by constructing and factorizing a multi-way concept detection tensor in a time-aware manner. We derived a weighted non-negative tensor factorization algorithm and applied it to model concepts’ temporal occurrence patterns and show how it boosts overall detection performance. The potential of our method is demonstrated on lifelog datasets with varying levels of original concept detection accuracies.

Peng Wang, Alan F. Smeaton, Cathal Gurrin


Additional information