Skip to main content

About this book

The two-volume set LNCS 8935 and 8936 constitutes the thoroughly refereed proceedings of the 21st International Conference on Multimedia Modeling, MMM 2015, held in Sydney, Australia, in January 2015. The 49 revised regular papers, 24 poster presentations, were carefully reviewed and selected from 189 submissions. For the three special session, a total of 18 papers were accepted for MMM 2015. The three special sessions are Personal (Big) Data Modeling for Information Access and Retrieval, Social Geo-Media Analytics and Retrieval and Image or video processing, semantic analysis and understanding. In addition, 9 demonstrations and 9 video showcase papers were accepted for MMM 2015. The accepted contributions included in these two volumes represent the state-of-the-art in multimedia modeling research and cover a diverse range of topics including: Image and Video Processing, Multimedia encoding and streaming, applications of multimedia modelling and 3D and augmented reality.

Table of Contents



A Proxemic Multimedia Interaction over the Internet of Things

With the rapid growth of online devices, a new concept of

Internet of Things

(IoT) is emerging in which everyday devices will be connected to the Internet. As the number of devices in IoT is increasing, so is the complexity of the interactions between the user and the devices. There is a need to design intelligent user interfaces that could assist users in interactions with multiple devices. The present study proposes a proximity-based user interface for multimedia devices over IoT. The proposed method employs a cloud-based decision engine to support user to choose and interact with the most appropriate device, reliving the user from the burden of enumerating available devices manually. The decision engine observes the multimedia content and device properties, learns user preferences adaptively, and automatically recommends the most appropriate device to interact. The system evaluation shows that the users agree with the proposed interaction 70% of the times.

Ali Danesh, Mukesh Saini, Abdulmotaleb El Saddik

Outdoor Air Quality Inference from Single Image

Along with rapid urbanization and industrialization processes, many developing countries are suffering from air pollution. Air quality varies non-linearly, the effective range of an air quality monitoring station is limited. While there are seldom air quality monitoring stations in cities, it is difficult to know the exact air quality of everywhere. How to obtain the air quality fast and conveniently will attract much attention. In this paper, we present an air quality inference approach based on air quality index(AQI) decision tree from a single image. We first extract several corresponding features such as medium transmission, power spectrum slope, contrast, and saturation from the single image. Then we construct a decision tree of AQI values, in accordance with the distance between the features we extract previously. For each none-leaf node of the decision tree, we use five classifiers to choose the next node respectively. We collect a dataset of high quality registered and calibrated images named Outdoor Air Quality Image Set(OAQIS). The dataset covers a wide range of daylight illumination and air pollution conditions. We evaluate our approach on the dataset, the results show the effective of our method.

Zheng Zhang, Huadong Ma, Huiyuan Fu, Xinpeng Wang

Multimodal Music Mood Classification by Fusion of Audio and Lyrics

Mood analysis from music data attracts both increasing research and application attentions in recent years. In this paper, we propose a novel multimodal approach for music mood classification incorporating audio and lyric information, which consists of three key components: 1) lyric feature extraction with a recursive hierarchical deep learning model, preceded by lyric filtering with discriminative reduction of vocabulary and synonymous lyric expansion; 2) saliency based audio feature extraction; 3) a Hough forest based fusion and classification scheme that fuses two modalities at the more fine-grained sentence level, utilizing the time alignment cross modalities. The effectiveness of the proposed model is verified by the experiments on a real dataset containing more than 3000 minutes of music.

Hao Xue, Like Xue, Feng Su

Multidimensional Context Awareness in Mobile Devices

With the increase of mobile computation ability and the development of wireless network transmission technology, mobile devices not only are the important tools of personal life (e.g., education and entertainment), but also emerge as indispensable ”secretary” of business activities (e.g., email and phone call). However, since mobile devices could work under complex and dynamic local and network conditions, they are vulnerable to local and remote security attacks. In real applications, different kinds of data protection are required by various local contexts. To provide appropriate protection, we propose a multidimensional context (


) scheme to comprehensively model and characterize the scene and activity of mobile users. Further, based on the scheme and RBAC, we also develop a novel access control system. Our experimental results indicate that it achieves promising performance comparing to traditional RBAC (Role-based Access Control).

Zhuo Wei, Robert H. Deng, Jialie Shen, Jixiang Zhu, Kun Ouyang, Yongdong Wu

AttRel: An Approach to Person Re-Identification by Exploiting Attribute Relationships

Person Re-Identification refers to recognizing people across cameras with non-overlapping capture areas. To recognize people, their images must be represented by feature vectors for matching. Recent state-of-the-art approaches employ semantic features, also known as attributes (e.g. wearing-bags, jeans, skirt), for presentation. However, such presentations are sensitive to attribute detection results which can be irrelevant due to noise. In this paper, we propose an approach to exploit relationships between attributes for refining attribute detection results. Experimental results on benchmark datasets (VIPeR and PRID) demonstrate the effectiveness of our proposed approach.

Ngoc-Bao Nguyen, Vu-Hoang Nguyen, Thanh Ngo Duc, Duy-Dinh Le, Duc Anh Duong

Sparsity-Based Occlusion Handling Method for Person Re-identification

Person re-identification has recently attracted a lot of research interests, it refers to recognizing people across non-overlapping surveillance cameras. However, person re-identification is essentially a very challenging task due to variations in illumination, viewpoints and occlusions. Existing methods address these difficulties through designing robust feature representation or learning proper distance metric. Although these methods have achieved satisfactory performance in the case of illumination and viewpoint changes, seldom of they can genuinely handle the occlusion problem that frequently happens in the real scene. This paper proposes a sparsity-based patch matching method to handle the occlusion problem in the person re-identification. Its core idea is using a sparse representation model to determine the occlusion state of each image patch, which is further utilized to adjust the weight of patch pairs in the feature matching process. Extensive comparative experiments conducted on two widely used datasets have shown the effectiveness of the proposed method.

Bingyue Huang, Jun Chen, Yimin Wang, Chao Liang, Zheng Wang, Kaimin Sun

Visual Attention Driven by Auditory Cues

Selecting Visual Features in Synchronization with Attracting Auditory Events

Human visual attention can be modulated not only by visual stimuli but also by ones from other modalities such as audition. Hence, incorporating auditory information into a human visual attention model would be a key issue for building more sophisticated models. However, the way of integrating multiple pieces of information arising from audio-visual domains still remains a challenging problem. This paper proposes a novel computational model of human visual attention driven by auditory cues. Founded on the Bayesian surprise model that is considered to be promising in the literature, our model uses surprising auditory events to serve as a clue for selecting synchronized visual features and then emphasizes the selected features to form the final surprise map. Our approach to audio-visual integration focuses on using effective visual features alone but not all available features for simulating visual attention with the help of auditory information. Experiments using several video clips show that our proposed model can better simulate eye movements of human subjects than other existing models in spite that our model uses a smaller number of visual features.

Jiro Nakajima, Akisato Kimura, Akihiro Sugimoto, Kunio Kashino

A Synchronization Ground Truth for the Jiku Mobile Video Dataset

This paper introduces and describes a manually generated synchronization ground truth, accurate to the level of the audio sample, for the Jiku Mobile Video Dataset, a dataset containing hundreds of videos recorded by mobile users at different events with drama, dancing and singing performances. It aims at encouraging researchers to evaluate the performance of their audio, video, or multimodal synchronization methods on a publicly available dataset, to facilitate easy benchmarking, and to ease the development of mobile video processing methods like audio and video quality enhancement, analytics and summary generation that depend on an accurately synchronized dataset.

Mario Guggenberger, Mathias Lux, Laszlo Böszörmenyi

Mobile Image Analysis: Android vs. iOS

Currently, computer vision applications are becoming more common on mobile devices due to the constant increase in raw processing power coupled with extended battery life. The OpenCV framework is a popular choice when developing such applications on desktop computers as well as on mobile devices, but there are few comparative performance studies available. We know of only one such study that evaluates a set of typical OpenCV operations on iOS devices. In this paper we look at the same operations, spanning from simple image manipulation like grayscaling and blurring to keypoint detection and descriptor extraction but on flagship Android devices as well as on iOS devices and with different image resolutions. We compare the results of the same tests running on the two platforms on the same datasets and provide extended measurements on completion time and battery usage.

Claudiu Cobârzan, Marco A. Hudelist, Klaus Schoeffmann, Manfred Jürgen Primus

Dynamic User Authentication Based on Mouse Movements Curves

In this paper we describe a behavioural biometric approach to authenticate users dynamically based on mouse movements only and using regular mouse devices. Unlike most of the previous approaches in this domain, we focus here on the properties of the curves generated from the consecutive mouse positions during typical mouse movements. Our underlying hypothesis is that these curves have enough discriminative information to recognize users. We conducted an experiment to test and validate our model in which ten participants are involved. A back propagation neural network is used as a classifier. Our experimental results show that behavioural information with discriminating features is revealed during normal mouse usage, which can be employed for user modeling for various reasons, such as information asset protection.

Zaher Hinbarji, Rami Albatal, Cathal Gurrin

Sliders Versus Storyboards – Investigating Interaction Design for Mobile Video Browsing

We present a comparative study of two different interfaces for mobile video browsing on tablet devices following two basic concepts - storyboard designs representing a video’s content in a grid-like arrangement of static images extracted from the file, and slider interfaces enabling users to interactively skim a video’s content at random speed and direction along the timeline. Our results confirm the usefulness and usability of both designs but do not suggest a clear benefit of either of them in the direct comparison, recommending – among other identified design issues – an interface integrating both concepts.

Wolfgang Hürst, Miklas Hoet

Performance Evaluation of Students Using Multimodal Learning Systems

Multimodal learning, as an effective method for helping students to understand complex concepts, has attracted much research interest recently. Using more than one media in the learning process typically makes the study material easier to grasp. In the current study, students annotate linguistic and visual elements in multimodal texts by using geometric shapes and assigning attributes. However, how to effectively evaluate student performance is a challenge. This work proposes to make use of a vector space model to process student-generated multimodal data, with a view to evaluating student performance based on the annotation data. The vector model consists of fuzzy membership functions to model the performance in the various annotation criteria. These vectors are then used as the input to a multi-criteria ranking framework to rank the students.

Subhasree Basu, Roger Zimmermann, Kay L. O’Halloran, Sabine Tan, Marissa K.L.E.

Is Your First Impression Reliable? Trustworthy Analysis Using Facial Traits in Portraits

As a basic human quality, trustworthiness plays an important role in social communications. In this paper, we proposed a novel concept to predict people’s trustworthiness at first sight using facial traits. Firstly, personality-toward traits were designed from psychology, including permanent traits and transient traits. Then, a mixture of feature descriptors consisting of Histogram of Gradients (HOG), Local Binary Patterns (LBP) and geometrical descriptions were adopted to describe personality traits. Finally, we trained the personality traits by LibSVM to determine trustworthiness of a person using portrait. Experiments demonstrated the effectiveness of our method by improving the precision by 33.60%, recall by 20.33% and F1-measure by 25.63% when determining whether a person is trustworthy or not comparing to a baseline method. Feature contribution analysis was applied to deeply unveil the correspondence between features and personality. Demonstration showed visual patterns in portrait collages of trustworthy people that further proved effectiveness of our method.

Yan Yan, Jie Nie, Lei Huang, Zhen Li, Qinglei Cao, Zhiqiang Wei

Wifbs: A Web-Based Image Feature Benchmark System

Automatic analysis of image data is of high importance for many applications. Given an image classification problem one needs three things: (i)

Training data

and tools to extract (ii)

relevant visual information

—usually image features—that can be used by (iii)

classification algorithms

. For given (i), a multitude of candidates present themselves for (ii) and (iii). Model selection becomes the main issue. We present a web-based feature benchmark system enabling system designers to streamline tool-chains to specific needs using available implementations of candidate tools. Our system features a modular architecture, remote and parallel computing, extensibility and—from a user’s standpoint—platform independence due to its web-based nature. Using


, image features can be subjected to a sophisticated and unbiased model selection procedure to compose optimized pipelines for given image classification problems.

Marcel Spehr, Sebastian Grottel, Stefan Gumhold

Personality Modeling Based Image Recommendation

With the increasing proliferation of data production technologies (like cameras) and consumption avenues (like social media) multimedia has become an interaction channel among users today. Images and videos are being used by the users to convey innate preferences and tastes. This has led to the possibility of using multimedia as a source for user-modeling, thereby contributing to the field of personalization, recommender systems, content generation systems and so on. This work investigates approaches for modeling personality traits (based on the Five Factor Modeling approach) of users based on a collection of images they tag as ‘favorite’ on Flickr. It presents several insights for improving the personality estimation performance by proposing better features and modeling approaches. The efficacy of the improved personality modeling approach is demonstrated by its use in an image recommendation system with promising results.

Sharath Chandra Guntuku, Sujoy Roy, Lin Weisi

Aesthetic QR Codes Based on Two-Stage Image Blending

With the popularization of smart phones, Quick Response (QR) code becomes one of the most widely used two-dimensional barcodes. Standard QR code consists of black and white squares (called


) and its noise-like appearance would seriously disrupt the aesthetic appeal of its carrier, e.g., a poster. To make QR code more aesthetic, this paper proposes an automatic approach for blending a visual-unattractive QR code with a background image. This approach consists of two stages: module-based blending and pixel-based blending. At the first stage, a binary aesthetic QR code is generated module by module. At the second stage, a color aesthetic QR code is further generated pixel by pixel. The advantages of our approach are: 1) greatly enhancing the aesthetic appearance of the original QR code, 2) maintaining the error correction capacity of the standard QR code, 3) allowing full-area blending of various photographs, drawings and graphics. Experimental results demonstrate that our approach produces high quality QR codes in terms of both visual appearance and readability.

Yongtai Zhang, Shihong Deng, Zhihong Liu, Yongtao Wang

Person Re-identification Using Data-Driven Metric Adaptation

Person re-identification, aiming to identify images of the same person from various cameras configured in difference places, has attracted plenty of attention in the multimedia community. In person re-identification procedure, choosing a proper distance metric is a crucial aspect [2]. Traditional methods always utilize a uniform learned metric, which ignored specific constraints given by this re-identification task that the learned metric is highly prone to over-fitting [21], and each person holding their unique characteristic brings inconsistency. Therefore, it is obviously inappropriate to merely employ a uniform metric. In this paper, we propose a data-driven metric adaptation method to improve the uniform metric. The key novelty of the approach is that we re-exploits the training data with cross-view consistency to adaptively adjust the metric. Experiments conducted on two standard data sets have validated the effectiveness of the proposed method with a significant improvement over baseline methods.

Zheng Wang, Ruimin Hu, Chao Liang, Junjun Jiang, Kaimin Sun, Qingming Leng, Bingyue Huang

A Novel Optimized Watermark Embedding Scheme for Digital Images

Many Scientists in Image Processing try to find an efficient way for digital multimedia protection. Although standards and criteria are still in developing, the watermarking which performs mark picture embedding and extraction with original image has been identified as major technology to achieve ownership and copyright protection. This paper is aim to find a more efficient way to embedding watermark into a gray-scale original image using a new algorithm – Artificial Bee Colony to optimize pixel by pixel embedding at different frequency levels (sub-band) with Discrete Wavelet Transform (DWT) Technology in order to enhance the security, invisibility to human visual and robustness of image watermarking. The proposed scheme will take efforts in higher level of DWT decomposition which provide better robustness but low quality of watermarked image and perform better quality of watermarked image and visible watermark compare to random embedding. The proposed new embedding method has been tested against most types of image modifications and in different frequency domain and levels of DWT to provide both high quality watermarked images and superior robustness.

Feng Sha, Felix Lo, Yuk Ying Chung, Xiaoming Chen, Wei-Chang Yeh

User-Centred Evaluation to Interface Design of E-Books

A good interface design of e-books is convenient for users, whereas a poor design can disorientate them. In this study, we conduct an experiment on four test scenarios, including the Zinio iPad version, Zinio PC version, MagV iPad version, and MagV PC version. This study includes 48 subjects (including 12 men and 36 women), with the majority being long-term Internet users. We use the performance measurement, retrospective testing, and a semistructured questionnaire to analyse the operation time of five experimental tasks, error frequency, and subjective satisfaction. The result shows the Zinio interface design provides a user-friendly device, and the iPad version (using a touchscreen) facilitates engaging experimental tasks compared with the PC version (with a mouse and/or a keyboard). In addition, most participants prefer to combine two or three options to complete the experimental tasks. The result can provide designers with a useful insight into designing a proper user model that best meets user requirements.

Yang-Cheng Lin

A New Image Decomposition and Reconstruction Approach – Adaptive Fourier Decomposition

Fourier has been a powerful mathematical tool for representing a signal into an expression consist of sin and cos. Recently a new developed signal decomposition theory is proposed by Pro. Tao Qian named Adaptive Fourier Decomposition, which has the advantage in time frequency over Fourier decomposition and without the need for a fixed window size problem such as short-time frequency transform. Studies show that AFD can fast decompose signals into positive-frequency functions with good analytical properties. In this paper we apply AFD into image decomposition and reconstruction area first time in the literature, which shows a promising result and gives the fundamental prospect for image compression.

Can He, Liming Zhang, Xiangjian He, Wenjing Jia

Video Showcase

Graph-Based Browsing for Large Video Collections

We present a graph-based browsing system for visually searching video clips in large collections. It is an extension of a previously proposed system


which allows visual browsing in millions of images using a hierarchical pyramid structure of images sorted by their similarities. Image subsets can be explored through a viewport at different pyramid levels, however, due to the underlying 2D-organization the high dimensional relationships between all images could not be represented. In order to preserve the complex inter-image relationships we propose to use a hierarchical graph where edges connect related images. By traversing this graph the users may navigate to other similar images. Different visualization and navigation modes are available. Various filters and search tools such as search by example, color, or sketch may be applied. These tools help to narrow down the amount of video frames to be inspected or to direct the view to regions of the graph where matching frames are located.

Kai Uwe Barthel, Nico Hezel, Radek Mackowiak

Enhanced Signature-Based Video Browser

The success of our Signature-Based Video Browser presented last year at Video Browser Showdown 2014 (now renamed to Video Search Showcase) was mainly based on effective filtering using position-color feature signatures, while browsing in the results comprising matched keyframes was based just on a simple sequential search approach. Since the results can consist of highly similar keyframes (e.g., news studio scenes) making the browsing more difficult, we have enhanced our tool with more advanced browsing techniques considering also homogeneous result sets obtained after filtering phase. Furthermore, we have utilized improved search models based on feature signatures to make the filtering phase more effective.

Adam Blažek, Jakub Lokoč, Filip Matzner, Tomáš Skopal

VERGE: A Multimodal Interactive Video Search Engine

This paper presents VERGE interactive video retrieval engine, which is capable of searching into video content. The system integrates several content-based analysis and retrieval modules such as video shot boundary detection, concept detection, clustering and visual similarity search.

Anastasia Moumtzidou, Konstantinos Avgerinakis, Evlampios Apostolidis, Fotini Markatopoulou, Konstantinos Apostolidis, Theodoros Mironidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris, Ioannis Patras

IMOTION — A Content-Based Video Retrieval Engine

This paper introduces the IMOTION system, a sketch-based video retrieval engine supporting multiple query paradigms. For vector space retrieval, the IMOTION system exploits a large variety of low-level image and video features, as well as high-level spatial and temporal features that can all be jointly used in any combination. In addition, it supports dedicated motion features to allow for the specification of motion within a video sequence. For query specification, the IMOTION system supports query-by-sketch interactions (users provide sketches of video frames), motion queries (users specify motion across frames via partial flow fields), query-by-example (based on images) and any combination of these, and provides support for relevance feedback.

Luca Rossetto, Ivan Giangreco, Heiko Schuldt, Stéphane Dupont, Omar Seddati, Metin Sezgin, Yusuf Sahillioğlu

A Storyboard-Based Interface for Mobile Video Browsing

We present an interface design for video browsing on mobile devices such as tablets that is based on storyboards and optimized with respect to content visualization and interaction design. In particular, we consider scientific results from our previous studies on mobile visualization (e.g., about optimum image sizes) and interaction (e.g., human perception and classification performance for different scrolling gestures) in order to create an interface for intuitive and efficient video content access. Our work aims at verifying if and to what degree optimized small screen designs utilizing touch screen gestures can compete with browsing methods on desktop PCs featuring significantly larger screen estate as well as more sophisticated input devices and interaction modes.

Wolfgang Hürst, Rob van de Werken, Miklas Hoet

Collaborative Browsing and Search in Video Archives with Mobile Clients

A system for collaborative browsing and search within video archives is proposed. It comprises of mobile clients and a back-end server. The server is responsible for inter-client communication as well as for archive partitioning according to the active client’s population. The participating clients employ a GUI designed and optimized for Nexus 7 tablets.

Claudiu Cobârzan, Manfred Del Fabro, Klaus Schoeffmann

The Multi-stripe Video Browser for Tablets

We present a prototype for video search and browsing in large video collections optimized for tablets. The content of the videos is organized into sub-shots, which are visualized by frame stripes of different configurations. Moreover, all videos can be filtered by color layout and motion patterns to reduce search effort. An additional overview mode enables the parallel inspection of multiple filtered or unfiltered videos at once. This mode should be both easy to use and still efficient, and therefore well-suited for novice users.

Marco A. Hudelist, Qing Xu

NII-UIT Browser: A Multimodal Video Search System

We introduce an interactive system for searching a known scene in a video database. The key idea is to enable multimodal search. As the retrieved database is getting larger, using individual modals may not be powerful enough to discriminate a scene with other near duplicates. In our system, a known scene can be described and searched by its visual cues or audio genres. Templates are given for users to rapidly and exactly describe the scene. Moreover, search results are updated instantly as users change the description. As a result, users can generate a large number of possible queries to find the matched scene in a short time.

Thanh Duc Ngo, Vinh-Tiep Nguyen, Vu Hoang Nguyen, Duy-Dinh Le, Duc Anh Duong, Shin’ichi Satoh

Interactive Known-Item Search Using Semantic Textual and Colour Modalities

In this paper, we present an interactive video browser tool for our participation in the fourth video search showcase event. Learning from previous experience, this year we focused on building an advanced interactive interface which allows users to quickly generate and combine different styles of query to find relevant video segments. The system offers the user a comprehensive search interface which has as key features: keyword search, color-region search and human face filtering.

Zhenxing Zhang, Rami Albatal, Cathal Gurrin, Alan F. Smeaton


ImageMap - Visually Browsing Millions of Images

In this paper we showcase ImageMap - an image browsing system to visually explore and search millions of images from stock photo agencies and the like. Similar to map services like Google Maps users may navigate through multiple image layers by zooming and dragging. Zooming in (or out) shows more (or less) similar images from lower (or higher) levels. Dragging the view shows related images from the same level. Layers are organized as an image pyramid which is build using image sorting and clustering techniques. Easy image navigation is achieved because the placement of the images in the pyramid is based on an improved fused similarity calculation using visual and semantic image information. Our system also allows to perform searches. After starting an image search the user is automatically directed to a region with suiting results. This paper describes how to efficiently construct an easily navigable image pyramid even if the total number of images is huge.

Kai Uwe Barthel, Nico Hezel, Radek Mackowiak

Dynamic Hierarchical Visualization of Keyframes in Endoscopic Video

The after-inspection of endoscopic surgeries can be a tedious and time consuming task. Physicians have to search for important segments in the video recording of an intervention, which may have a duration of several hours. Automatically selected keyframes can support physicians in this task. The problem is that either too few keyframes are selected, missing some important information, or too many keyframes are selected, which overwhelms the user. Furthermore, keyframes of endoscopic videos typically show highly similar content. It is hence difficult to keep track of the temporal context of selected keyframes if they are presented in a grid view. To overcome these limitations, we present a dynamic hierarchical browsing technique for large sets of keyframes that preserves the temporal context in the visualization of the frames.

Jakub Lokoč, Klaus Schoeffmann, Manfred del Fabro

Facial Aging Simulator by Data-Driven Component-Based Texture Cloning

Facial aging and rejuvenation simulation is a challenging topic because keeping personal characteristics in every age is difficult problem. In this demonstration, we simulate a facial aging/rejuvenating only from a single photo. Our system alters an input face image to aged face by reconstructing every facial component with face database for target age. An appropriate facial components image are selected by a special similarity measurement between current age and target age to keep personal characteristics as much as possible. Our system successfully generated aged/ rejuvenated faces with age-related features such as spots, wrinkles, and sagging while keeping personal characteristics throughout all ages.

Daiki Kuwahara, Akinobu Maejima, Shigeo Morishima

Affective Music Recommendation System Based on the Mood of Input Video

We present an affective music recommendation system just fitting to an input video without textual information. Music that matches our current environmental mood can enhance a deep impression. However, we cannot know easily which music best matches our present mood from huge music database. So we often select a well-known popular song repeatedly in spite of the present mood. In this paper, we analyze the video sequence which represent current mood and recommend an appropriate music which affects the current mood. Our system matches an input video with music using valence-arousal plane which is an emotional plane.

Shoto Sasaki, Tatsunori Hirai, Hayato Ohya, Shigeo Morishima

MemLog, an Enhanced Lifelog Annotation and Search Tool

As of very recently, we have observed a convergence of technologies that have led to the emergence of lifelogging as a potentially pervasive technology with many real-world use cases. While it is becoming easier to gather massive lifelog data archives with wearable cameras and sensors, there are still challenges in developing effective retrieval systems. One such challenge is in gathering annotations to support user access or machine learning tasks in an effective and efficient manner. In this work, we demonstrate a web-based annotation system for sensory and visual lifelog data and show it in operation on a large archive of nearly 1 million lifelog images and 27 semantic concepts in 4 categories.

Lijuan Marissa Zhou, Brian Moynagh, Liting Zhou, TengQi Ye, Cathal Gurrin

Software Solution for HEVC Encoding and Decoding

In this demonstration, we showcase a complete software encoding and decoding solution for the new High Efficiency Video Coding (HEVC) standard. The encoder is optimized for x86 processors using SSE instruction set extension and multi-thread technology, and achieves high efficiency at a significantly reduced computation load. We have integrated the encoder library into the widely-used media framework FFmpeg and developed transcoding and recording applications for HEVC. The decoder is highly optimized for both x86 and ARM architecture. With novel single-instruction-multiple-data (SIMD) algorithms and a frame-based parallel framework for multi-core CPUs, decoding speed of 46FPS for 1080p videos on ARM Cortex-A9 1.5GHz dual-core processor and 75FPS for 4K (3840x2160) videos on Intel i7-2600 3.4GHz quad-core processor can be achieved. We have also integrated the decoder library into FFmpeg and made an Android video player based on that. The software solution can well meet the demand of producing and watching HEVC videos on existing devices, showing promising future of HEVC applications.

Shengbin Meng, Jun Sun, Zongming Guo

A Surveillance Video Index and Browsing System Based on Object Flags and Video Synopsis

This paper demonstrates a novel retrieval and browsing system based on moving objects for surveillance video. Under the pressure of digital video surveillance generalization, massive data with ever-increasing volume has been involved. How to effectively and efficiently employ the surveillance videos is strategically important in practical applications. In order to improve the availability of videos, intelligent applications contain object extraction, video indexing, video retrieval, and fast browsing. Specifically, This system includes two retrieval browsing sub-systems: (1) as for the retrieval browsing based on moving objects, it can achieve the “browsing with object storage” and “browsing with object classification”; (2) as for the retrieval browsing based on video synopsis, it can achieve the “browsing with playback synopsis” and “browsing with customized synopsis”. As shown in demos, video index and synopsis browsing can be flexibly and efficiently realized in this system.

Gensheng Ye, Wenjuan Liao, Jichao Dong, Dingheng Zeng, Huicai Zhong

A Web Portal for Effective Multi-model Exploration

During last decades, there have emerged various similarity models suitable for specific similarity search tasks. In this paper, we present a web-based portal that combines two popular similarity models (based on feature signatures and SURF descriptors) in order to improve the recall of multimedia exploration. Comparing to single-model approach, we demonstrate in the game-like fashion that a multi-model approach could provide users with more diverse and still relevant results.

Tomáš Grošup, Přemysl Čech, Jakub Lokoč, Tomáš Skopal

Wearable Cameras for Real-Time Activity Annotation

Google Glass has potential to be a real-time data capture and annotation tool. With professional sports as a use-case, we present a platform which helps a football coach capture and annotate interesting events using Google Glass. In our implementation, an interesting event is indicated by a predefined hand gesture or motion, and our platform can automatically detect these gestures in a video without training any classifier. Three event detectors are examined and our experiment shows that the detector with combined edgeness and color moment features gives the best detection performance.

Jiang Zhou, Aaron Duane, Rami Albatal, Cathal Gurrin, Dag Johansen

Personal (Big) Data Modeling for Information Access & Retrieval

Making Lifelogging Usable: Design Guidelines for Activity Trackers

Of all lifelogging tools, activity trackers are probably among the most widely used ones receiving most public attention. However, when used on a long-term basis e.g. for prevention and wellbeing, the devices’ acceptance by the user and its usability become critical issues. In a user study we explored how activity trackers are used and experienced in daily life. We identified critical issues with regard not just to the HCI topics wearability, appearance of the device, and display and interaction, but also to aspects of modeling and describing the measured and presented data. We suggest four guidelines for the design of future activity trackers. Ideally, activity tracking would be fulfilled by a modular concept of building blocks for sensing, interaction and feedback that the user can freely combine, distribute and wear according to personal preferences and situations.

Jochen Meyer, Jutta Fortmann, Merlin Wasmann, Wilko Heuten

Towards Consent-Based Lifelogging in Sport Analytic

Lifelogging is becoming widely deployed outside the scope of solipsistic self quantification. In elite sport, the ability to utilize these digital footprints of athletes for sport analytic has already become a game changer. This raises privacy concerns regarding both the individual lifelogger and the bystanders inadvertently captured by increasingly ubiquitous sensing devices. This paper describes a lifelogging model for consented use of personal data for sport analytic. The proposed model is a stepping stone towards understanding how privacy-preserving lifelogging frameworks and run-time systems can be constructed.

Håvard Johansen, Cathal Gurrin, Dag Johansen

A Multi-Dimensional Data Model for Personal Photo Browsing

Digital photo collections—personal, professional, or social—have been growing ever larger, leaving users overwhelmed. It is therefore increasingly important to provide effective browsing tools for photo collections. Learning from the resounding success of multi-dimensional analysis (MDA) in the business intelligence community for On-Line Analytical Processing (OLAP) applications, we propose a multi-dimensional model for media browsing, called M


, that combines MDA concepts with concepts from faceted browsing. We present the data model and describe preliminary evaluations, made using server and client prototypes, which indicate that users find the model useful and easy to use.

Björn Þór Jónsson, Grímur Tómasson, Hlynur Sigurþórsson, Áslaug Eiríksdóttir, Laurent Amsaleg, Marta Kristín Lárusdóttir

Discriminative Regions: A Substrate for Analyzing Life-Logging Image Sequences

Life-logging devices are becoming ubiquitous, yet still processing and extracting information from the vast amount of data that is being captured is a very challenging task. We propose a method to find discriminative regions which we define as regions that are salient, consistent, repetitive and discriminative. We explain our fast and novel algorithm to discover the discriminative regions and show different applications for discriminative regions such as summarization, classification and image search. Our experiments show that our algorithm is able to find discriminative regions and discriminative patches in a short time and extracts great results on our life-logging SenseCam dataset.

Mohammad Moghimi, Jacqueline Kerr, Eileen Johnson, Suneeta Godbole, Serge Belongie

Fast Human Activity Recognition in Lifelogging

This paper addresses the problem of fast Human Activity Recognition (HAR) in visual lifelogging. We identify the importance of visual features related to HAR and we specifically evaluate the HAR discrimination potential of Colour Histograms and Histogram of Oriented Gradients. In our evaluation we show that colour can be a low-cost and effective means of low-cost HAR when performing single-user classification. It is also noted that, while much more efficient, global image descriptors perform as well or better than local descriptors in our HAR experiments. We believe that both of these findings are due to the fact that a user’s lifelog is rich in reoccurring scenes and environments.

Stefan Terziyski, Rami Albatal, Cathal Gurrin

Social Geo-Media Analytics and Retrieval

Iron Maiden While Jogging, Debussy for Dinner?

An Analysis of Music Listening Behavior in Context

Contextual information of the listener is only slowly being integrated into music retrieval and recommendation systems. Given the enormous rise in mobile music consumption and the many sensors integrated into today’s smart-phones, at the same time, an unprecedented source for user context data of different kinds is becoming available.

Equipped with a smart-phone application, which had been developed to monitor contextual aspects of users when listening to music, we collected contextual data of listening events for 48 users. About 100 different user features, in addition to music meta-data have been recorded.

In this paper, we analyze the relationship between aspects of the

user context


music listening preference

. The goals are to assess (i) whether user context factors allow predicting the song, artist, mood, or genre of a listened track, and (ii) which contextual aspects are most promising for an accurate prediction. To this end, we investigate various classifiers to learn relations between user context aspects and music meta-data. We show that the user context allows to predict artist and genre to some extent, but can hardly be used for song or mood prediction. Our study further reveals that the level of listening activity has little influence on the accuracy of predictions.

Michael Gillhofer, Markus Schedl

Travel Recommendation via Author Topic Model Based Collaborative Filtering

While automatic travel recommendation has attracted a lot of attentions, the existing approaches generally suffer from different kinds of weaknesses. For example, sparsity problem can significantly degrade the performance of traditional collaborative filtering (CF). If a user only visits very few locations, accurate similar user identification becomes very challenging due to lack of sufficient information. Motivated by this concern, we propose an Author Topic Collaborative Filtering (ATCF) method to facilitate comprehensive Points of Interest (POIs) recommendation for social media users. In our approach, the topics about user preference (e.g., cultural, cityscape, or landmark) are extracted from the textual description of photos by author topic model instead of from GPS (geo-tag). Consequently, unlike CF based approaches, even without GPS records, similar users could still be identified accurately according to the similarity of users’ topic preferences. In addition, ATCF doesn’t pre-define the category of travel topics. The category and user topic preference could be elicited simultaneously. Experiment results with a large test collection demonstrate various kinds of advantages of our approach.

Shuhui Jiang, Xueming Qian, Jialie Shen, Tao Mei

Robust User Community-Aware Landmark Photo Retrieval

Given a query photo characterizing a location-aware landmark shot by a user, landmark retrieval is about returning a set of photos ordered in their similarities to the photo. Existing studies on landmark retrieval focus on exploiting location-aware visual features or attributes to conduct a matching process between candidate images and a query image. However, these approaches are based on a hypothesis that a landmark of interest is well-captured and distinctive enough to be distinguished from others. In fact, distinctive landmarks may be biasedly taken due to bad viewpoints or angles. This will discourage the recognition results if a biased query photo is issued. In this paper, we present a novel approach towards landmark retrieval by exploiting the dimension of user community. Our approach in this system consists of three steps. First, we extract communities based on user interest which can characterize a group of users in terms of their social media activities such as user-generated contents/comments. Then, a group of photos that are recommended by the community to which the query user belongs, together with the query photo, can constitute a set of multiple queries. Finally, a pattern mining algorithm is presented to discover regular landmark-specific patterns from this multi-query set. These patterns can faithfully represent the characteristics of a landmark of interest. Experiments conducted on benchmarks are conducted to show the effectiveness of our approach.

Lin Wu, John Shepherd, Xiaodi Huang, Chunzhi Hu

Cross-Domain Concept Detection with Dictionary Coherence by Leveraging Web Images

We propose a novel scheme to address video concept learning by leveraging social media, one that includes the selection of web training data and the transfer of subspace learning within a unified framework. Due to the existence of cross-domain incoherence resulting from the mismatch of data distributions, how to select sufficient positive training samples from scattered and diffused social media resources is a challenging problem in the training of effective concept detectors. In this paper, given a concept, the coherent positive samples from web images for further concept learning are selected based on the degree of image coherence. Then, by exploiting both the selected dataset and video keyframes, we train a robust concept classifier by means of a transfer subspace learning method. Experiment results demonstrate that the proposed approach can achieve constant overall improvement despite cross-domain incoherence.

Yongqing Sun, Kyoko Sudo, Yukinobu Taniguchi

Semantic Correlation Mining between Images and Texts with Global Semantics and Local Mapping

This paper proposes a novel approach for the modeling of semantic correlation between web images and texts. Our approach contains two processes of semantic correlation computing. One is to find the local media objects (LMOs), the components composing text (or image) documents, that match the global semantics of a given image(or text) document based on probabilistic latent semantic analysis (PLSA); The other is to make a direct mapping among LMOs with graph-based learning, with those LMOs achieved based on PLSA as a part of inputs. The two cooperating processes consider both dominant semantics and local subordinate parts of heterogeneous data. Finally, we compute the similarity between the obtained LMOs and a whole document of the same modality and then get the semantic correlation between textual and visual documents. Experimental results demonstrate the effectiveness of the proposed approach.

Jiao Xue, Youtian Du, Hanbing Shui

Image Taken Place Estimation via Geometric Constrained Spatial Layer Matching

In recent years, estimating the locations of images has received a lot of attention, which plays a role in application scenarios for large geo-tagged image corpora. So, as to images which are not geographically tagged, we could estimate their locations with the help of the large geo-tagged image set by visual mining based approach. In this paper, we propose a global feature clustering and local feature refinement based image location estimation approach. Firstly, global feature clustering is utilized. We further treat each cluster as a single observation. Next we mine the relationship of each image cluster and locations offline. By cluster selection online, several refined locations likely to be related to an input image are pre-selected. Secondly, we localize the input image by local feature matching which utilizes the “SIFT” descriptor extracted from the refined images. In this process, “spatial layers of visual word” (SLW) is built as an extension of the unorganized bag-of-words image representation. Experiments show the effectiveness of our proposed approach.

Yisi Zhao, Xueming Qian, Tingting Mu

Image or Video Processing, Semantic Analysis, and Understanding

Recognition of Meaningful Human Actions for Video Annotation Using EEG Based User Responses

To provide interesting videos, it is important to generate relevant tags and annotations that describe the whole video or its segment efficiently. Because generating annotations and tags is a time-consuming process, it is essential for analyzing videos without human intervention. Although there have been many studies of implicit human-centered tagging using bio-signals, most of them focus on affective tagging and tag relevance assessment. This paper proposes binary and unary classification models that recognize actions meaningful to users in videos, for example jumps in the figure skating program, using EEG features of band power (BP) values and asymmetry scores (AS). As a result, the binary and binary classification models achieved the best balanced accuracies of 52.86% and 50.06% respectively. The binary classification models showed high specificity on non-jump actions and the unary classification models showed high sensitivity on jump actions.

Jinyoung Moon, Yongjin Kwon, Kyuchang Kang, Changseok Bae, Wan Chul Yoon

Challenging Issues in Visual Information Understanding Researches

Visual information understanding is known as one of the most difficult and challenging problems in the realization of machine intelligence. This paper presents research issues and overview of the current state of the art in the general flow of visual information understanding. In general, the first stage of the visual understanding starts from the object segmentation. Using the saliency map based on human visual attention model is one of the most promising methods for object segmentation. The next step is scene understanding by analyzing semantics between objects in a scene. This stage finds description of image data with a formatted text. The third step requires space understanding and context awareness using multi-view analysis. This step helps solving general occlusion problem very easily. The final stage is time series analysis of scenes and a space. After this stage, we can obtain visual information from a scene, a series of scenes, and space variations. Various technologies for visual understanding already have been tried and some of them are matured. Therefore, we need to leverage and integrate those techniques properly from the perspective of higher visual information understanding.

Kyuchang Kang, Yongjin Kwon, Jinyoung Moon, Changseok Bae

Emotional Tone-Based Audio Continuous Emotion Recognition

Understanding human emotions in natural communication is still a challenge problem to be solved in human-computer interaction. Emotional tone that people feel within a period of time can affect the way people communicate with others or environment. In this paper, a new emotional tone-based two-stage algorithm for continuous emotion recognition from audio signals is presented. Gaussian mixture models of hidden Markov models (GMM-HMMs) are employed to infer the dimensional emotional tone and affect labels. Two emotional tones, positive or negative, which represent the overall emotion state over an audio clip are first obtained. Then, based on that emotional tone, corresponding positive or negative GMM-HMM classifier is refined to finish the continuous emotion recognition. The experimental results show that our method outperforms the GMM-HMM and SVR in baseline for the Audio-Visual Emotion Challenge (AVEC 2014) database [1].

Mengmeng Liu, Hui Chen, Yang Li, Fengjun Zhang

A Computationally Efficient Algorithm for Large Scale Near-Duplicate Video Detection

Large scale near-duplicate video detection is very desirable for web video processing, especially the computational efficiency is essential for practical applications. In this paper, we present a computationally efficient algorithm based on multi-layer video content analysis. Local features are extracted from key frames of videos and indexed by an novel adaptive locality sensitive hashing scheme. By learning several parameters, fast retrieval in the new hashing structure is performed without high dimensional distance computations and achieves better real-time retrieving performance compared with other state-of-the-art approaches. Then a descriptor filtering method and a two-level matching scheme is performed to generate a relevance score for detection. Experiments on near-duplicate video detection tasks including various transformed videos demonstrate the efficiency gains of the proposed algorithm.

Dawei Liu, Zhihua Yu

SLOREV: Using Classical CAD Techniques for 3D Object Extraction from Single Photo

While the perception of an object from a single image is hard for machines, it is a much easier task for humans since humans often have prior knowledge about the underlying nature of the object. Considerable work has recently been done on the combination of human perception with machines’ computational capability to solve some ill-posed problems such as 3D reconstruction from single image. In this work we present SLOREV (Sweep-Loft-Revolve), a novel method for modeling 3D objects using 2D shape snapping and traditional computer-aided design techniques. The user assists recognition and reconstruction by choosing, drawing and placing specific 2D shapes. The machine then snaps the shapes to the automatically detected contour lines, calculates their orientations in 3D space, and constructs the original 3D objects following classical CAD methods.

Pan Hu, Hongming Cai, Fenglin Bu

Hessian Regularized Sparse Coding for Human Action Recognition

With the rapid increase of online videos, recognition and search in videos becomes a new trend in multimedia computing. Action recognition in videos thus draws intensive research concerns recently. Second, sparse representation has become state-of-the-art solution in computer vision because it has several advantages for data representation including easy interpretation, quick indexing and considerable connection with biological vision. One prominent sparse representation algorithm is Laplacian regularized sparse coding (LaplacianSC). However, LaplacianSC biases the results toward a constant and thus results in poor generalization. In this paper, we propose Hessian regularized sparse coding (HessianSC) for action recognition. In contrast to LaplacianSC, HessianSC can well preserve the local geometry and steer the sparse coding varying linearly along the manifold of data distribution. We also present a fast iterative shrinkage-thresholding algorithm (FISTA) for HessianSC. Extensive experiments on human motion database (HMDB51) demonstrate that HessianSC significantly outperforms LaplacianSC and the traditional sparse coding algorithm for action recognition.

Weifeng Liu, Zhen Wang, Dapeng Tao, Jun Yu

Robust Multi-label Image Classification with Semi-Supervised Learning and Active Learning

Most existing work on multi-label learning focused on supervised learning which requires manual annotation samples that is labor-intensive, time-consuming and costly. To address such a problem, we present a novel method that incorporates active learning into the semi-supervised learning for multi-label image classification. What’s more, aiming at the curse of dimensionality existing in high-dimensional data, we explore a dimensionality reduction technique with non-negative sparseness constraint to extract a group of features that can completely describe the data and hence make the learning model more efficiently. Experimental results on common data sets validate that the proposed algorithm is relatively effective to improve the performance of the learner in multi-label classification, and the obtained learner is with reliability and robustness after data dimensionality using NNS-DR (Non-Negative Sparseness for Dimensionality Reduction).

Fuming Sun, Meixiang Xu, Xiaojun Jiang

Photo Quality Assessment with DCNN that Understands Image Well

Photo quality assessment from the view of human aesthetics, which tries to classify images into the categories of good and bad, has drawn a lot of attention in computer vision field. Up to now, experts have proposed many methods to deal with this problem. Most of those methods are based on the design of hand-crafted features. However, due to the complexity and subjectivity of human’s aesthetic activities, it is difficult to describe and model all the factors that affect the photo aesthetic quality. Therefore those methods just obtain limited success. On the other hand, deep convolutional neural network has been proved to be effective in many computer vision problems and it does not need human efforts in the design of features. In this paper, we try to adopt a deep convolutional neural network that “understands” images well to conduct the photo aesthetic quality assessment. Firstly, we implement a deep convolutional neural network which has eight layers and millions of parameters. Then to “teach” this network enough knowledge about images, we train it on the ImageNet which is one of the largest available image database. Next, for each given image, we take the activations of the last layer of the neural network as its aesthetic feature. The experimental results on two large and reliable image aesthetic quality assessment datasets prove the effectiveness of our method.

Zhe Dong, Xu Shen, Houqiang Li, Xinmei Tian

Non-negative Low-Rank and Group-Sparse Matrix Factorization

Non-negative matrix factorization (NMF) has been a popular data analysis tool and has been widely applied in computer vision. However, conventional NMF methods cannot adaptively learn grouping structure from a dataset. This paper proposes a non-negative low-rank and group-sparse matrix factorization (NLRGS) method to overcome this deficiency. Particularly, NLRGS captures the relationships among examples by constraining rank of the coefficients meanwhile identifies the grouping structure via group sparsity regularization. By both constraints, NLRGS boosts NMF in both classification and clustering. However, NLRGS is difficult to be optimized because it needs to deal with the low-rank constraint. To relax such hard constraint, we approximate the low-rank constraint with the nuclear norm and then develop an optimization algorithm for NLRGS in the frame of augmented Lagrangian method(ALM). Experimental results of both face recognition and clustering on four popular face datasets demonstrate the effectiveness of NLRGS in quantities.

Shuyi Wu, Xiang Zhang, Naiyang Guan, Dacheng Tao, Xuhui Huang, Zhigang Luo

Two-Dimensional Euler PCA for Face Recognition

Principal component analysis (PCA) projects data on the directions with maximal variances. Since PCA is quite effective in dimension reduction, it has been widely used in computer vision. However, conventional PCA suffers from following deficiencies: 1) it spends much computational costs to handle high-dimensional data, and 2) it cannot reveal the nonlinear relationship among different features of data. To overcome these deficiencies, this paper proposes an efficient two-dimensional Euler PCA (2D-


PCA) algorithm. Particularly, 2D-


PCA learns projection matrix on the 2D pixel matrix of each image without reshaping it into 1D long vector, and uncovers nonlinear relationships among features by mapping data onto complex representation. Since such 2D complex representation induces much smaller kernel matrix and principal subspaces, 2D-


PCA costs much less computational overheads than Euler PCA on large-scale dataset. Experimental results on popular face datasets show that 2D-


PCA outperforms the representative algorithms in terms of accuracy, computational overhead, and robustness.

Huibin Tan, Xiang Zhang, Naiyang Guan, Dacheng Tao, Xuhui Huang, Zhigang Luo

Multiclass Boosting Framework for Multimodal Data Analysis

A large number of multimedia documents containing texts and images have appeared on the internet, hence cross-modal retrieval in which the modality of a query is different from that of the retrieved results is being an interesting search paradigm. In this paper, a multimodal multiclass boosting framework (MMB) is proposed to capture intra-modal semantic information and inter-modal semantic correlation. Unlike traditional boosting methods which are confined to two classes or single modality, MMB could simultaneously deal with multimodal data. The empirical risk, which takes both intra-modal and inter-modal losses into account, is designed and then minimized by gradient descent in the multidimensional functional spaces. More specifically, the optimization problem is solved in turn for each modality. Semantic space can be naturally attained by applying sigmoid function to the quasi-margins. Extensive experiments on the Wiki and NUS-WIDE datasets show that the performance of our method significantly outperforms those of existing approaches for cross-modal retrieval.

Shixun Wang, Peng Pan, Yansheng Lu, Sheng Jiang


Additional information