Skip to main content

2014 | Buch

MultiMedia Modeling

20th Anniversary International Conference, MMM 2014, Dublin, Ireland, January 6-10, 2014, Proceedings, Part I

herausgegeben von: Cathal Gurrin, Frank Hopfgartner, Wolfgang Hurst, Håvard Johansen, Hyowon Lee, Noel O’Connor

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 8325 and 8326 constitutes the thoroughly refereed proceedings of the 20th Anniversary International Conference on Multimedia Modeling, MMM 2014, held in Dublin, Ireland, in January 2014. The 46 revised regular papers, 11 short papers and 9 demonstration papers were carefully reviewed and selected from 176 submissions. 28 special session papers and 6 papers from Video Browser Showdown workshop are also included in the proceedings. The papers included in these two volumes cover a diverse range of topics including: applications of multimedia modelling, interactive retrieval, image and video collections, 3D and augmented reality, temporal analysis of multimedia content, compression and streaming. Special session papers cover the following topics: Mediadrom: artful post-TV scenarios, MM analysis for surveillance video and security applications, 3D multimedia computing and modeling, social geo-media analytics and retrieval, multimedia hyperlinking and retrieval.

Inhaltsverzeichnis

Frontmatter

Interactive Indexing and Retrieval

A Comparative Study on the Use of Multi-label Classification Techniques for Concept-Based Video Indexing and Annotation

Exploiting concept correlations is a promising way for boosting the performance of concept detection systems, aiming at concept-based video indexing or annotation. Stacking approaches, which can model the correlation information, appear to be the most commonly used techniques to this end. This paper performs a comparative study and proposes an improved way of employing stacked models, by using multi-label classification methods in the last level of the stack. The experimental results on the TRECVID 2011 and 2012 semantic indexing task datasets show the effectiveness of the proposed framework compared to existing works. In addition to this, as part of our comparative study, we investigate whether the evaluation of concept detection results at the level of individual concepts, as is typically the case in the literature, is appropriate for assessing the usefulness of concept detection results in both video indexing applications and in the somewhat different problem of video annotation.

Fotini Markatopoulou, Vasileios Mezaris, Ioannis Kompatsiaris
Coherence Analysis of Metrics in LBP Space for Interactive Face Retrieval

Interactive retrieval model is a useful solution for the multimedia retrieval applications in case of targets unavailable. The goodness of such model relies on a high coherence between human and machine cognition about the regarded retrieval task. In this paper, we specially perform coherence analysis for interactive face retrieval and explore the influence of metrics to human and machine face recognition in Local Binary Pattern (LBP) feature space. With the collected real user feedback, we discover several new conclusions about unbalanced coherence distribution model and propose an improved correntropy metrics that leads to improved coherence and fast retrieval.

Yuchun Fang, Ying Tan, Chanjuan Yu
A Hybrid Machine-Crowd Approach to Photo Retrieval Result Diversification

In this paper we address the issue of optimizing the actual social photo retrieval technology in terms of users’ requirements. Typical users are interested in taking possession of accurately relevant-to-the-query and non-redundant images so they can build a correct exhaustive perception over the query. We propose to tackle this issue by combining two approaches previously considered non-overlapping: machine image analysis for a pre-filtering of the initial query results followed by crowd-sourcing for a final refinement. In this mechanism, the machine part plays the role of reducing the time and resource consumption allowing better crowd-sourcing results. The machine technique ensures representativeness in images by performing a re-ranking of all images according to the most common image in the initial noisy set; additionally, diversity is ensured by clustering the images and selecting the best ranked images among the most representative in each cluster. Further, the crowd-sourcing part enforces both representativeness and diversity in images, objectives that are, to a certain extent, out of reach by solely the automated machine technique. The mechanism was validated on more than 25,000 photos retrieved from several common social media platforms, proving the efficiency of this approach.

Anca-Livia Radu, Bogdan Ionescu, María Menéndez, Julian Stöttinger, Fausto Giunchiglia, Antonella De Angeli
Visual Saliency Weighting and Cross-Domain Manifold Ranking for Sketch-Based Image Retrieval

A Sketch-Based Image Retrieval (SBIR) algorithm compares a line-drawing sketch with images. The comparison is made difficult by image background clutter. A query sketch includes an object of interest only, while database images would also contain background clutters. In addition, variability of hand-drawn sketches, due to “stroke noise” such as disconnected and/or wobbly lines, also makes the comparison difficult. Our proposed SBIR algorithm compares edges detected in an image with lines in a sketch. To emphasize presumed object of interest and disregard backgrounds, we employ Visual Saliency Weighting (VSW) of edges in the database image. To effectively compare the sketch containing stroke noise with database images, we employ Cross-Domain Manifold Ranking (CDMR), a manifold-based distance metric learning algorithm. Our experimental evaluation using two SBIR benchmarks showed that the combination of VSW and CDMR significantly improves retrieval accuracy.

Takahiko Furuya, Ryutarou Ohbuchi
A Novel Approach for Semantics-Enabled Search of Multimedia Documents on the Web

We present an analysis of a large corpus of multimedia documents obtained from the web. From this corpus of documents, we have extracted the media assets and the relation information between the assets. In order to conduct our analysis, the assets and relations are represented using a formal ontology. The ontology not only allows for representing the structure of multimedia documents but also to connect with arbitrary background knowledge on the web. The ontology as well as the analysis serve as basis for implementing a novel search engine for multimedia documents on the web.

Lydia Weiland, Ansgar Scherp
Video to Article Hyperlinking by Multiple Tag Property Exploration

Showing video and article on the same page, as done by official web agencies such as CNN.com and Yahoo!, provides a practical way for convenient information digestion. However, as the absence of article, this layout is infeasible for mainstream web video repositories like YouTube. This paper investigates the problem of hyperlinking web videos to relevant articles available on the Web. Given a video, the task is accomplished by firstly identifying its contextual tags (e.g.,

who

are doing

what

at

where

and

when

) and then employing a search based association to relevant articles. Specifically, we propose a multiple tag property exploration (mTagPE) approach to identify contextual tags, where tag relevance, tag clarity and tag correlation are defined and measured by leveraging visual duplicate analyses, online knowledge bases and tag co-occurrence. Then, the identification task is formulated as a random walk along a tag relation graph that smoothly integrates the three properties. The random walk aims at picking up relevant, clear and correlated tags as a set of contextual tags, which is further treated as a query to issue commercial search engines to obtain relevant articles. We have conducted experiments on a largescale web video dataset. Both objective performance evaluations and subjective user studies show the effectiveness of the proposed hyperlinking. It produces more accurate contextual tags and thus a larger number of relevant articles than other approaches.

Zhineng Chen, Bailan Feng, Hongtao Xie, Rong Zheng, Bo Xu
Rebuilding Visual Vocabulary via Spatial-temporal Context Similarity for Video Retrieval

The Bag-of-visual-Words (BovW) model is one of the most popular visual content representation methods for large-scale content-based video retrieval. The visual words are quantized according to a visual vocabulary, which is generated by a visual features clustering process (

e

.

g

. K-means, GMM, etc). In principle, two types of errors can occur in the quantization process. They are referred to as the

UnderQuantize

and

OverQuantize

problems. The former causes ambiguities and often leads to false visual content matches, while the latter generates synonyms and may lead to missing true matches. Unlike most state-of-the-art research that concentrated on enhancing the BovW model by disambiguating the visual words, in this paper, we aim to address the

OverQuantize

problem by incorporating the similarity of spatial-temporal contexts associated to pair-wise visual words. The visual words with similar context and appearance are assumed to be synonyms. These synonyms in the initial visual vocabulary are then merged to rebuild a more compact and descriptive vocabulary. Our approach was evaluated on the TRECVID2002 and CC_WEB_VIDEO datasets for two typical Query-By-Example (QBE) video retrieval applications. Experimental results demonstrated substantial improvements in retrieval performance over the initial visual vocabulary generated by the BovW model. We also show that our approach can be utilized in combination with the state-of-the-art disambiguation method to further improve the performance of the QBE video retrieval.

Lei Wang, Eyad Elyan, Dawei Song
Approximating the Signature Quadratic Form Distance Using Scalable Feature Signatures

The feature signatures in connection with the signature quadratic form distance have become a respected similarity model for effective multimedia retrieval. However, the efficiency of the model is still a challenging task because the signature quadratic form distance has quadratic time complexity according to the number of tuples in feature signatures. In order to reduce the number of tuples in feature signatures, we introduce the scalable feature signatures, a new formal framework based on hierarchical clustering enabling definition of various feature signature reduction techniques. As an example, we use the framework to define a new feature signature reduction technique based on joining of the tuples. We experimentally demonstrate our new feature signature reduction technique can be used to implement more efficient yet effective filter distances approximating the original signature quadratic form distance. We also show the filter distances using our new feature signature reduction technique significantly outperform the filter distances based on the related maximal component feature signatures.

Jakub Lokoč
A Novel Human Action Representation via Convolution of Shape-Motion Histograms

Robust solutions to vision-based human action recognition require effective representations of body shapes and their dynamics. Combining multiple cues in the input space can improve the recognition task. Although conventional method such as concatenation of feature vectors is straightforward, it may not sufficiently encapsulate the characteristics of an action. Inspired by the success of convolution-based reverb application in digital signal processing, we propose a novel method to synergistically combine shape and motion histograms via convolution operation. The objective is to synthesize the output (action representation) which carries the characteristics of both source inputs (shape and motion). Analysis and experimental results on the Weizmann and KTH datasets show that the resultant feature is more efficient than other hybrid features. Compared to other recent works, the feature that we used has much lower dimension. In addition, our method avoids the need for determining weights manually during feature concatenation.

Teck Wee Chua, Karianto Leman
How Do Users Search with Basic HTML5 Video Players?

When searching within a video for a specific scene most non-expert users employ a basic video player. The main advantage of such a player over more advanced retrieval tools lies in its ease of use and familiar controls and mode of operation. This means that the available navigation controls (play, fast forward, fast reverse, seeker-bar) will be used for interactive search and browsing. We compare the search behavior by type of interaction and speed of interactive search of two groups of users, each numbering 17 participants. Both groups performed the same tasks using an HTML5 video player but in different setups: the first group performed Known Item Search tasks, while the second performed Description Based Search tasks. The goal of this study is twofold. One: better understand the way users search with a basic video player, so that useful insights can be taken into consideration when designing professional video browsing and search tools. Two: evaluate the impact of the different setups (Known Item Search vs. Description Based Search tasks).

Claudiu Cobârzan, Klaus Schoeffmann

Multimedia Collections

Visual Recognition by Exploiting Latent Social Links in Image Collections

Social network study has become an important topic in many research fields. Early works on social network analysis focus on real world social interactions in either human society or animal world. With the explosion of Internet data, social network researchers start to pay more attention to the tremendous amount of online social network data. There are ample space for exploring social network research on large-scale online visual content. In this paper, we focus on studying multi-label collective classification problem and develop a model that can harness the mutually beneficial information among the visual appearance, related semantic content and the social network structure simultaneously. Our algorithm is then tested on CelebrityNet, a social network constructed by inferring implicit relationship of people based on online multimedia content. We apply our model to a few important multimedia applications such as image annotation and community classification. We demonstrate that our algorithm significantly outperforms traditional methods on community classification and image annotation.

Li-Jia Li, Xiangnan Kong, Philip S. Yu
Collections for Automatic Image Annotation and Photo Tag Recommendation

This paper highlights a number of problems which exist in the evaluation of existing image annotation and tag recommendation methods. Crucially, the collections used by these state-of-the-art methods contain a number of biases which may be

exploited

or

detrimental

to their evaluation, resulting in misleading results. In total we highlight

seven

issues for

three

popular annotation evaluation collections, i.e. Corel5k, ESP Game and IAPR, as well as

three

issues with collections used in

two

state-of-the-art photo tag recommendation methods. The result of this paper is two freely available Flickr image collections designed for the fair evaluation of image annotation and tag recommendation methods called Flickr-AIA and Flickr-PTR respectively. We show through experimentation and demonstration that these collection are ultimately fairer benchmarks than existing collections.

Philip J. McParlane, Yashar Moshfeghi, Joemon M. Jose
Graph-Based Multimodal Clustering for Social Event Detection in Large Collections of Images

A common approach to the problem of SED in collections of multimedia relies on the use of clustering methods. Due to the heterogeneity of features associated with multimedia items in such collections, such a clustering task is very challenging and special multimodal clustering approaches need to be deployed. In this paper, we present a

scalable graph-based multimodal clustering

approach for SED in large collections of multimedia. The proposed approach utilizes example relevant clusterings to learn a model of the “same event” relationship between two items in the multimodal domain and subsequently to organize the items in a graph. Two variants of the approach are presented: the first based on a batch and the second on an incremental community detection algorithm. Experimental results indicate that both variants provide excellent clustering performance.

Georgios Petkos, Symeon Papadopoulos, Emmanouil Schinas, Yiannis Kompatsiaris
Tag Relatedness Using Laplacian Score Feature Selection and Adapted Jensen-Shannon Divergence

Folksonomies - networks of users, resources, and tags allow users to easily retrieve, organize and browse web contents. However, their advantages are still limited according to the noisiness of user provided tags. To overcome this problem, we propose an approach for identifying related tags in folksonomies. The approach uses tag co-occurrence statistics and Laplacian score feature selection to create probability distribution for each tag. Consequently, related tags are determined according to the distance between their distributions. In this regards, we propose a distance metric based on Jensen-Shannon Divergence. The new metric named AJSD deals with the noise in the measurements due to statistical fluctuations in tag co-occurrences. We experimentally evaluated our approach using WordNet and compared it to a common tag relatedness approach based on the cosine similarity. The results show the effectiveness of our approach and its advantage over the adversary method.

Hatem Mousselly-Sergieh, Mario Döller, Elöd Egyed-Zsigmond, Gabriele Gianini, Harald Kosch, Jean-Marie Pinon
User Intentions in Digital Photo Production: A Test Data Set

Taking a photo with a digital camera or camera phone is a process triggered by a certain motivation. People want for instance to document the progress of a task, others want to preserve a moment of joy. In this contribution we present an openly available dataset with 1,309 photos along with annotations specifying the intentions of the photographers. This data set is the result of a large survey on Flickr and shall provide a common basis for joint research on user intentions in photo production. The survey data was validated using Amazon Mechanical Turk. Besides discussing the process of creating the data set we also present information of the structure and give statistics on the data set.

Mathias Lux, Desara Xhura, Alexander Kopper
Personal Media Reunion: Re-collecting Media Content Scattered over Smart Devices and Social Networks

With the rapidly growing use of smart phones it becomes easier for people to document social events. Parts of the multimedia content are shared on social networks like Facebook, while others stay private on the users’ phones. Friends also add content of the same event to Facebook. The media content of social events is becoming scattered over different accounts and becomes increasingly difficult for the user to re-collect, especially with the absence of the EXIF-header in Facebook. In this paper, we introduce an approach that supports the users in automatically detecting distributed event content after selecting very few photos and videos on the phone. We use the metadata of the content on the phone to increase the initial seeds, and then we exploit a novel face recognition approach using the social context, and a probabilistic fusion model to detect the distributed media content between several social contacts on Facebook to the respective events. Our event content detection significantly outperforms other approaches that give poor result in metadata-unfriendly environment such as Facebook.

Mohamad Rabbath, Susanne Boll
Summarised Presentation of Personal Photo Sets

People produce an increasing amount of digital photos to document events. Searching for a specific event can result in more photos than people can handle, making difficult judging their relevance. This paper presents a new algorithm, that summarises a set of photos described by attributes at different concept levels. It addresses the well-known human weakness to deal with large collections of distinct items, by presenting a low cardinality partition set. Each group yields a compact, yet distinct, description. The evaluation, including user tests, shows the algorithm outperforms others in context separation and informative power about the set being summarised.

Nuno Datia, João Moura-Pires, Nuno Correia

Applications

MOSRO: Enabling Mobile Sensing for Real-Scene Objects with Grid Based Structured Output Learning

Visual objects in mobile photos are usually captured in uncontrolled conditions, such as various viewpoints, positions, scales, and background clutter. In this paper, therefore, we developed a MObile Sensing framework for robust Real-scene Object recognition and localization (MOSRO). By extending the conventional structured output learning with the proposed grid based representation as the output structure, MOSRO is not only able to locate the visual objects precisely but also achieve real-time performances. The experimental results showed that the proposed framework outperforms the state-of-the-art methods on public real-scene image datasets. Further, to demonstrate its effectiveness for practical applications, the proposed MOSRO framework was implemented on Android mobile platforms as a prototype system for sensing various business signs on the street and instantly retrieving relevant information of the recognized businesses.

Heng-Yu Chi, Wen-Huang Cheng, Ming-Syan Chen, Arvin Wen Tsui
TravelBuddy: Interactive Travel Route Recommendation with a Visual Scene Interface

In this work, we propose a convenient system for trip planning and aim to change the behavior of trip planners from exhaustively searching information to receiving useful travel recommendations. Given the essential and optional user inputs, our system automatically recommends a route that suits the traveler based on a real-time route planning algorithm and allows the user to make adjustment according to their preferences. We construct a traveling database by collecting photos taken around famous attractions and analyzing these photos to extract each attraction’s travel information including popularity, typical stay time, available visiting time in a day, and visual scenes of different time. All the extracted travel information are presented to the user to help him/her efficiently know more about different attractions so that he/she can modify the inputs to obtain a more favorable travel route. The experimental results show that our system can effectively help the user to plan the journey.

Cheng-Yao Fu, Min-Chun Hu, Jui-Hsin Lai, Hsuan Wang, Ja-Ling Wu
Who’s the Best Charades Player? Mining Iconic Movement of Semantic Concepts

Charades is a guessing game with the idea for one player to act out a semantic concept (i.e. a word or phrase) for the other players to guess. An observation from playing charades is that people’s cognition on the iconic movements associated with a semantic concept would be often inconsistent, and this fact has long been ignored in the multimedia research. Therefore, the novelty of this work is to propose an automation for mining the most representative videos for each semantic concept as its iconic movements from a large set of related videos containing various human actions. The discovered iconic movements can be further employed to benefit a broad range of tasks, such as human action recognition and retrieval. For our purpose, a new video benchmark is also presented and the experiments demonstrated our approach potential to human action based applications.

Yung-Huan Hsieh, Shintami C. Hidayati, Wen-Huang Cheng, Min-Chun Hu, Kai-Lung Hua
Tell Me about TV Commercials of This Product

TV commercial archives, once recorded the fashion and technology of our society, contain large amount of information deserved for deep analysis, for instance, discovery of hot products, exploration of the relationship between the air times and market sales of a product, analysis and prediction of the market trends, and so on. Levering a new text-to-features transformation and integrating many state-of-the-art video search techniques, we have built an interactive system on top of video retrieval in a large collection of three-year five-channel TV commercial videos. To the best of our knowledge, this is the largest commercial data set used for retrieval so far. To interact with the system, users can either use a keyboard to type keywords or use their mobile devices to snap a picture to describe their interested products, and the system will return relevant commercials in real time. Users are further able to browse videos and access their air patterns, such as air time and air frequency. This pattern usually reflects social behavior of viewers,

i.e.

which social groups (young or adult, male or female) are the targets of the product, when is the peak time for viewers to watch this commercial category according to the air pattern.

Cai-Zhi Zhu, Siriwat Kasamwattanarote, Xiaomeng Wu, Shin’ichi Satoh
A Data-Driven Personalized Digital Ink for Chinese Characters

In this paper, we propose a novel data-driven digital ink generating method for Chinese characters. When users write on tablets, they can select a specific rendering style from the predefined style database to generate character images in the selected style. Our method is able to learn personalized style information from a small set of Chinese characters in a given style and then generate the stroke style database using a specially-designed stroke segmentation model. The model divides a stroke into three kinds of segments: head, corner and middle segments. For each type of stroke segments, we employ the corresponding method to process the style information they contain. Experimental results demonstrate that our method not only works well for Chinese characters but is also effective for other kinds of shapes.

Tianyang Yi, Zhouhui Lian, Yingmin Tang, Jianguo Xiao
Local Segmentation for Pedestrian Tracking in Dense Crowds

People tracking in dense crowds is challenging due to the high levels of inter-pedestrian occlusions occurring continuously. After each successive occlusion, the surface of the tracked object that has never been hidden reduces. If not corrected, this shrinking problem eventually causes the system to stop as the area to track become too small. In this paper we investigate how hidden parts of one target object can be recovered after occlusions and propose challenging data to evaluate such segmentation-tracking technique in dense crowds. The segmentation/tracking problem is particularly difficult to solve for non-rigid objects. Here, we focus on pedestrians whose limbs and lower body parts often get occluded in crowded scene. We first investigate the unmet challenges of pedestrian tracking in crowds and propose a challenging video to evaluate segmentation-tracking robustness to inter-pedestrian occlusions. We then detail a fast segmentation-based method to overcome some aspects of the tracking-under-occlusion problem. We finally compare our results with two existing tracking methods.

Clement Creusot
An Optimization Model for Aesthetic Two-Dimensional Barcodes

Given a message

m

and a logo image

L

, we want to generate an image that is visually similar to

L

and yet carries

m

in the payload with respect to a 2D barcode reader. This problem is similar to digital image watermarking, except that there is a requirement of using the specific barcode readers and applications, which are pre-installed in the end-users’ devices. We formulate the generation as an optimization problem that considers operations carried out by the barcode readers, in particular, the sampling process and error correction. We propose a two-phase algorithm that solves the optimization problem. We adapt the algorithm to QR code and made a few observations to further enhance its performance.

Chengfang Fang, Chunwang Zhang, Ee-Chien Chang
Live Key Frame Extraction in User Generated Content Scenarios for Embedded Mobile Platforms

In this work we investigated the suitability of Key Frame Extraction solutions in embedded architectures for mobile platforms. In particular, in our scenario the list of key frames is requested right after a shooting session ends. The most interesting outcome has been the evaluation of performances in User Generated Content scenarios through an extensive survey based on Opinion Scores, interviews and other minor metrics testing five different solutions. Results suggest that pursuing sophisticated algorithms doesn’t necessarily enrich the end user experience. We hope that this work will contribute to stimulate the debate on the KFE also in realistic scenarios and to pay attention to the user feedbacks to drive the investigation.

Alexandro Sentinelli, Luca Celetto
Understanding Affective Content of Music Videos through Learned Representations

In consideration of the ever-growing available multimedia data, annotating multimedia content automatically with feeling(s) expected to arise in users is a challenging problem. In order to solve this problem, the emerging research field of video affective analysis aims at exploiting human emotions. In this field where no dominant feature representation has emerged yet, choosing discriminative features for the effective representation of video segments is a key issue in designing video affective content analysis algorithms. Most existing affective content analysis methods either use low-level audio-visual features or generate hand-crafted higher level representations based on these low-level features. In this work, we propose to use deep learning methods, in particular convolutional neural networks (CNNs), in order to learn mid-level representations from automatically extracted low-level features. We exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the RGB space in order to build higher level audio and visual representations. We use the learned representations for the affective classification of music video clips. We choose multi-class support vector machines (SVMs) for classifying video clips into four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results on a subset of the DEAP dataset (on 76 music video clips) show that a significant improvement is obtained when higher level representations are used instead of low-level features, for video affective content analysis.

Esra Acar, Frank Hopfgartner, Sahin Albayrak
Robust Image Restoration via Reweighted Low-Rank Matrix Recovery

In this paper, we propose a robust image restoration method via reweighted low-rank matrix recovery. In the literature, Principal Component Pursuit (PCP) solves low-rank matrix recovery problem via a convex program of mixed nuclear norm and ℓ

1

norm. Inspired by reweighted ℓ

1

minimization for sparsity enhancement, we propose reweighting singular values to enhance low rank of a matrix. An efficient iterative reweighting scheme is proposed for enhancing low rank and sparsity simultaneously and the performance of low-rank matrix recovery is prompted greatly. We demonstrate the utility of the proposed method on robust image restoration, including single image and hyperspectral image restoration. All of these experiments give appealing results on robust image restoration.

Yigang Peng, Jinli Suo, Qionghai Dai, Wenli Xu, Song Lu
Learning to Infer Public Emotions from Large-Scale Networked Voice Data

Emotions are increasingly and controversially central to our public life. Compared to text or image data, voice is the most natural and direct way to express ones’ emotions in real-time. With the increasing adoption of smart phone voice dialogue applications (e.g., Siri and Sogou Voice Assistant), the large-scale networked voice data can help us better quantitatively understand the emotional world we live in. In this paper, we study the problem of inferring public emotions from large-scale networked voice data. In particular, we first investigate the primary emotions and the underlying emotion patterns in human-mobile voice communication. Then we propose a partially-labeled factor graph model (PFG) to incorporate both acoustic features (e.g., energy, f0, MFCC, LFPC) and correlation features (e.g., individual consistency, time associativity, environment similarity) to automatically infer emotions. We evaluate the proposed model on a real dataset from Sogou Voice Assistant application. The experimental results verify the effectiveness of the proposed model.

Zhu Ren, Jia Jia, Lianhong Cai, Kuo Zhang, Jie Tang
Joint People Recognition across Photo Collections Using Sparse Markov Random Fields

We show how to jointly recognize people across an entire photo collection while considering the specifies of personal photos that often depict multiple people. We devise and explore a sparse but efficient graph design based on a second-order Markov Random Field, and that utilizes a distance-based face description method. Experiments on two datasets demonstrate and validate the effectiveness of our probabilistic approach compared to traditional methods.

Markus Brenner, Ebroul Izquierdo

Temporal Analysis

Event Detection by Velocity Pyramid

In this paper, we propose velocity pyramid for multimedia event detection. Recently, spatial pyramid matching is proposed to introduce coarse geometric information into Bag of Features framework, and is effective for static image recognition and detection. In video, not only spatial information but also temporal information, which represents its dynamic nature, is important. In order to fully utilize it, we propose velocity pyramid where video frames are divided into motional sub-regions. Our method is effective for detecting events characterized by their temporal patterns. Experiment on the dataset of MED (Multimedia Event Detection) has shown 10% improvement of performance by velocity pyramid than without this method. Further, when combined with spatial pyramid, velocity pyramid provides an extra 3% gains to the detection result.

Zhuolin Liang, Nakamasa Inoue, Koichi Shinoda
Fusing Appearance and Spatio-temporal Features for Multiple Camera Tracking

Multiple camera tracking is a challenging task for many surveillance systems. The objective of multiple camera tracking is to maintain trajectories of objects in the camera network. Due to ambiguities in appearance of objects, it is challenging to re-identify objects when they re-appear in other cameras. Most research works associate objects by using appearance features. In this work, we fuse appearance and spatio-temporal features for person re-identification. Our framework consists of two steps: preprocessing to reduce the number of association candidates and associating objects by using the probabilistic relative distance. We set up an experimental environment including 10 cameras and achieve a better performance than using appearance features only.

Nam Trung Pham, Karianto Leman, Richard Chang, Jie Zhang, Hee Lin Wang
A Dense SURF and Triangulation Based Spatio-temporal Feature for Action Recognition

In this paper, we propose a novel method of extracting spatio-temporal features from videos. Given a video, we extract its features according to every set of

N

frames. The value of

N

is small enough to guarantee the temporal denseness of our features. For each frame set, we first extract dense SURF keypoints from its first frame. We then select points with the most likely dominant and reliable movements, and consider them as interest points. In the next step, we form triangles of interest points using Delaunay triangulation and track points within each triple through the frame set. We extract one spatio-temporal feature from each triangle based on its shape feature along with the visual features and optical flows of its points. This enables us to extract spatio-temporal features based on groups of related points and their trajectories. Hence the features can be expected to be robust and informative. We apply Fisher Vector encoding to represent videos using the proposed spatio-temporal features. We conduct experiments on several challenging benchmarks, and show the effectiveness of our proposed method.

Do Hang Nga, Keiji Yanai
Resource Constrained Multimedia Event Detection

We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set.

Zhen-Zhong Lan, Yi Yang, Nicolas Ballas, Shoou-I Yu, Alexander Haputmann
Random Matrix Ensembles of Time Correlation Matrices to Analyze Visual Lifelogs

Visual lifelogging is the process of automatically recording images and other sensor data for the purpose of aiding memory recall. Such lifelogs are usually created using wearable cameras. Given the vast amount of images that are maintained in a visual lifelog, it is a significant challenge for users to deconstruct a sizeable collection of images into meaningful events. In this paper, random matrix theory (RMT) is applied to a cross-correlation matrix

C

, constructed using SenseCam lifelog data streams to identify such events. The analysis reveals a number of eigenvalues that deviate from the spectrum suggested by RMT. The components of the deviating eigenvectors are found to correspond to “distinct significant events” in the visual lifelogs. Finally, the cross-correlation matrix

C

is cleaned by separating the noisy part from the non-noisy part. Overall, the RMT technique is shown to be useful to detect major events in SenseCam images.

Na Li, Martin Crane, Heather J. Ruskin, Cathal Gurrin

3D and Augmented Reality

Exploring Distance-Aware Weighting Strategies for Accurate Reconstruction of Voxel-Based 3D Synthetic Models

In this paper, we propose and evaluate various distance-aware weighting strategies to improve reconstruction accuracy of a voxel-based model according to the Truncated Signed Distance Function (TSDF), from the data obtained by low-cost depth sensors. We look at two strategy directions: (a)

weight definition

strategies prioritizing importance of the sensed data depending on the data accuracy, and (b)

model updating

strategies defining the level of influence of the new data on the existing 3D model. In particular, we introduce

Distance-Aware (DA)

and

Distance-Aware Slow-Saturation (DASS)

updating methods to intelligently integrate the depth data into the synthetic 3D model based on the distance-sensitivity metric of a low-cost depth sensor. By quantitative and qualitative comparison of the resulting synthetic 3D models to the corresponding ground-truth models, we identify the most promising strategies, which lead to an accuracy improvement involving a reduction of the model error by 10 − 35%.

Hani Javan Hemmat, Egor Bondarev, Peter H. N. de With
Exploitation of Gaze Data for Photo Region Labeling in an Immersive Environment

Metadata describing the content of photos are of high importance for applications like image search or as part of training sets for object detection algorithms. In this work, we apply tags to image regions for a more detailed description of the photo semantics. This region labeling is performed without additional effort from the user, just from analyzing eye tracking data, recorded while users are playing a gaze-controlled game. In the game

EyeGrab

, users classify and rate photos falling down the screen. The photos are classified according to a given category under time pressure. The game has been evaluated in a study with 54 subjects. The results show that it is possible to assign the given categories to image regions with a precision of up to 61%. This shows that we can perform an almost equally good region labeling using an immersive environment like in

EyeGrab

compared to a previous classification experiment that was much more controlled.

Tina Walber, Ansgar Scherp, Steffen Staab
MR Simulation for Re-wallpapering a Room in a Free-Hand Movie

This paper proposes a wallpaper replacement method in a free-hand movie. Our method superimposes computer graphics (CG) wallpaper onto a real wall region in the free-hand movie looking around a room. To extract wallpaper planes, ordinarily a special and expensive 3D survey instrument is required. However, there is usually no such instrument in homes or offices where many users want to experience wallpaper replacement simulation. To solve this problem, we extract the wallpaper region by an image segmentation technique with user interaction. By applying our method, wallpaper replacement can be easily achieved using only a handy camera and a PC.

Masashi Ueda, Itaru Kitahara, Yuichi Ohta
Segment and Label Indoor Scene Based on RGB-D for the Visually Impaired

The growing study in RGB-D sensor and 3D point cloud have made new progress in obstacle avoidance for the visually impaired. However, it remains a challenging problem due to the difficulty in design a robust and real-time algorithm. In this paper, we focus on scene segmentation and labeling. As man-made indoor scene contains many planar area and structure, plane segmentation and classification is important for further scene analysis. This work propose a multiscale-voxel strategy to reduce the effects of noise and improve plane segmentation. Then the segmentation result is combined with depth data and color data to apply graph-based image segmentation algorithm. After that, a cascaded decision tree is trained to classify different segments into different semantical type. The method is tested on part of the NYU Depth Dataset. Experimental results show that the proposed method combines the advantages of depth data and the geometry characteristics of the scene, and improves scene segmentation and obstacle detection.

Zhe Wang, Hong Liu, Xiangdong Wang, Yueliang Qian
A Low-Cost Head and Eye Tracking System for Realistic Eye Movements in Virtual Avatars

A virtual avatar or autonomous agent is a digital representation of a human being that can be controlled by either a human or an artificially intelligent computer system. Increasingly avatars are becoming realistic virtual human characters that exhibit human behavioral traits, body language and eye and head movements. As the interpretation of eye and head movements represents an important part of nonverbal human communication it is extremely important to accurately reproduce these movements in virtual avatars to avoid falling into the well-known “uncanny valley”. In this paper we present a cheap hybrid real-time head and eye tracking system based on existing open source software and commonly available hardware. Our evaluation indicates that the system of head and eye tracking is stable and accurate and can allow a human user to robustly puppet a virtual avatar, potentially allowing us to train an A.I. system to learn realistic human head and eye movements.

Yingbo Li, Haolin Wei, David S. Monaghan, Noel E. O’Connor
Real-Time Skeleton-Tracking-Based Human Action Recognition Using Kinect Data

In this paper, a real-time tracking-based approach to human action recognition is proposed. The method receives as input depth map data streams from a single kinect sensor. Initially, a skeleton-tracking algorithm is applied. Then, a new action representation is introduced, which is based on the calculation of spherical angles between selected joints and the respective angular velocities. For invariance incorporation, a pose estimation step is applied and all features are extracted according to a continuously updated torso-centered coordinate system; this is different from the usual practice of using common normalization operators. Additionally, the approach includes a motion energy-based methodology for applying horizontal symmetry. Finally, action recognition is realized using Hidden Markov Models (HMMs). Experimental results using the Huawei/3DLife 3D human reconstruction and action recognition Grand Challenge dataset demonstrate the efficiency of the proposed approach.

Georgios Th. Papadopoulos, Apostolos Axenopoulos, Petros Daras
Kinect vs. Low-cost Inertial Sensing for Gesture Recognition

In this paper, we investigate efficient recognition of human gestures / movements from multimedia and multimodal data, including the Microsoft Kinect and translational and rotational acceleration and velocity from wearable inertial sensors. We firstly present a system that automatically classifies a large range of activities (17 different gestures) using a random forest decision tree. Our system can achieve near real time recognition by appropriately selecting the sensors that led to the greatest contributing factor for a particular task. Features extracted from multimodal sensor data were used to train and evaluate a customized classifier. This novel technique is capable of successfully classifying various gestures with up to 91 % overall accuracy on a publicly available data set. Secondly we investigate a wide range of different motion capture modalities and compare their results in terms of gesture recognition accuracy using our proposed approach. We conclude that gesture recognition can be effectively performed by considering an approach that overcomes many of the limitations associated with the Kinect and potentially paves the way for low-cost gesture recognition in unconstrained environments.

Marc Gowing, Amin Ahmadi, François Destelle, David S. Monaghan, Noel E. O’Connor, Kieran Moran
Yoga Posture Recognition for Self-training

Self-training plays an important role in sports exercise, but improper training postures can cause serious harm to muscles and ligaments of the body. Hence, more and more researchers are devoted into the development of computer-assisted self-training systems for sports exercise. In this paper, we propose a Yoga posture recognition system, which is capable of recognizing what Yoga posture the practitioner is performing, and then retrieving Yoga training information from Internet to remind his/her attention to the posture. First, a Kinect is used for capturing the user body map and extracting the body contour. Then,

star skeleton

, which is a fast skeletonization technique by connecting from centroid of target object to contour extremes, is used as a representative descriptor of human posture for Yoga posture recognition. Finally, some Yoga training information for the recognized posture can be retrieved from Internet to remind the practitioner what to pay attention to when practicing the posture.

Hua-Tsung Chen, Yu-Zhen He, Chun-Chieh Hsu, Chien-Li Chou, Suh-Yin Lee, Bao-Shuh P. Lin
Real-Time Gaze Estimation Using a Kinect and a HD Webcam

In human-computer interaction, gaze orientation is an important and promising source of information to demonstrate the attention and focus of users. Gaze detection can also be an extremely useful metric for analysing human mood and affect. Furthermore, gaze can be used as an input method for human-computer interaction. However, currently real-time and accurate gaze estimation is still an open problem. In this paper, we propose a simple and novel estimation model of the real-time gaze direction of a user on a computer screen. This method utilises cheap capturing devices, a HD webcam and a Microsoft Kinect. We consider that the gaze motion from a user facing forwards is composed of the local gaze motion shifted by eye motion and the global gaze motion driven by face motion. We validate our proposed model of gaze estimation and provide experimental evaluation of the reliability and the precision of the method.

Yingbo Li, David S. Monaghan, Noel E. O’Connor

Compression, Transcoding and Streaming

A Framework of Video Coding for Compressing Near-Duplicate Videos

With the development of multimedia technique and social network, the amount of videos has grown rapidly, which brings about an increasingly substantial percentage of Near-Duplicate Videos (NDVs). It has been a hot research topic to retrieve NDVs for a number of applications such as copyright detection, Internet video ranking, etc. However, there exist a lot of redundancies in NDVs, and to the best of our knowledge it is an untouched research area on how to efficiently compress NDVs in a joint manner. In this work, a novel video coding framework is proposed to effectively compress NDVs by making full use of the relevance among them. Experimental results demonstrate that a significant storage saving can be achieved by the proposed NDV coding framework.

Hanli Wang, Ming Ma, Yu-Gang Jiang, Zhihua Wei
An Improved Similarity-Based Fast Coding Unit Depth Decision Algorithm for Inter-frame Coding in HEVC

The emerging High Efficiency Video Coding (HEVC) aims to achieve significantly improved compression performance with respect to the state-of-the-art H.264/AVC high profile. This better compression performance is achieved at a much higher computational complexity, which makes it difficult for real-time video systems. To reduce the encoding complexity, this paper presents an improved similarity-based fast coding unit depth decision algorithm for inter-frame coding. Firstly, a fast and precise depth information acquisition method is proposed to improve the accuracy of depth prediction. Secondly, CTUs of the medium similarity degree is further divided into two similarity degree categories, and the complexity of coding the coding tree units of these two categories is reduced by different coding unit depth decision strategies. Experimental results show that proposed algorithm can save on average 35.72% of the encoding time with negligible loss on the rate-distortion performance compared with HM11.0 and consistently outperforms the state-of-the-art schemes.

Rui Fan, Yongfei Zhang, Zhe Li, Ning Wang
Low-Complexity Rate-Distortion Optimization Algorithms for HEVC Intra Prediction

HEVC achieves a better coding efficiency relative to prior standards, but also involves dramatically increased complexity. The complexity increase for intra prediction is especially intensive due to a highly flexible quad-tree coding structure and a large number of prediction modes.

The encoder employs rate-distortion optimization (RDO) to select the optimal coding mode. And RDO takes a great portion of intra encoding complexity.Moreover HEVC has stronger dependency on RDO than H.264/AVC. To reduce the computational complexity and to implement a real-time system,this paper presents two low-complexity RDO algorithms for HEVC intra prediction. The structure of RDO is simplified by the proposed rate and distortion estimators, and some hardware-unfriendly modules are facilitated. Compared with the original RDO procedure, the two proposed algorithms reduce RDO time by 46% and 64% respectively with acceptable coding efficiency loss.

Zhe Sheng, Dajiang Zhou, Heming Sun, Satoshi Goto
Factor Selection for Reinforcement Learning in HTTP Adaptive Streaming

At present, HTTP Adaptive Streaming (HAS) is developing into a key technology for video delivery over the Internet. In this delivery strategy, the client proactively and adaptively requests a quality version of chunked video segments based on its playback buffer, the perceived network bandwidth and other relevant factors. In this paper, we discuss the use of reinforcement-learning (RL) to learn the optimal request strategy at the HAS client by progressively maximizing a pre-defined Quality of Experience (QoE)-related reward function. Under the framework of RL, we investigate the most influential factors for the request strategy, using a forward variable selection algorithm. The performance of the RL-based HAS client is evaluated by a Video-on-Demand (VOD) simulation system. Results show that given the QoE-related reward function, the RL-based HAS client is able to optimize the quantitative QoE. Comparing with a conventional HAS system, the RL-based HAS client is more robust and flexible under versatile network conditions.

Tingyao Wu, Werner Van Leekwijck
Stixel on the Bus: An Efficient Lossless Compression Scheme for Depth Information in Traffic Scenarios

The modern automotive industry has to meet the requirement of providing a safer, more comfortable and interactive driving experience. Depth information retrieved from a stereo vision system is one significant resource enabling vehicles to understand their environment. Relying on the stixel, a compact representation of depth information using thin planar rectangles, the problem of processing huge amounts of depth data in real-time can be solved. In this paper, we present an efficient lossless compression scheme for stixels, which further reduces the data volume by a factor of 3.3863. The predictor of the proposed approach is adapted from the LOCO-I (LOw COmplexity LOssless COmpression for Images) algorithm in the JPEG-LS standard. The compressed stixel data could be sent to the in-vehicle communication bus system for future vehicle applications such as autonomous driving and mixed reality systems.

Qing Rao, Christian Grünler, Markus Hammori, Samarjit Chakraborty
A New Saliency Model Using Intra Coded High Efficiency Video Coding (HEVC) Frames

The computation of visual attention is an exhaustive procedure to locate conspicuous regions within a frame, which contrast with the surrounding background. In this paper we propose a unique algorithm to estimate visual saliency in the compressed domain using intra-coded frames from High Efficiency Video Coding (HEVC) encoded video sequences. By exclusively combining data obtained from the coding unit structure, intra mode block predictions and the residual data, a visual saliency approximation is obtained. The proposed model can accurately detect salient regions without the need to fully decode the HEVC bitstream. Experimental results show the proposed algorithm compares positively against multiple methods in the literature, highlighting accurate saliency detection with minimal time additions to the video coding computation. The new methodology can provide aid to a wide variety of fields such as advertising, watermarking, video editing and spatial-temporal adaptation.

Matthew Oakes, Charith Abhayaratne
Multiple Reference Frame Transcoding from H.264/AVC to HEVC

The emerging video coding standard, so called High Efficiency Video Coding (HEVC), has been recently developed by the ITU-T and JCT-VC groups to replace the current H.264/AVC standard. That standard has been very successful and it has been widely adopted in the last years. Hence, there will be a need for efficient conversion between the H.264/AVC standard to HEVC. In this paper, we present a fast motion estimation mechanism to speed up the transcoding process between H.264/AVC to HEVC. Because HEVC and H.264/AVC share a similar coding architecture, we try to exploit the information gathered in the H.264/AVC decoding algorithm by means of reducing the references frames checked. Experimental results show that the proposed transcoding mechanism can achieve a good tradeoff between coding efficiency and complexity in terms of motion estimation cost.

Antonio Jesus Diaz-Honrubia, Jose Luis Martinez, Pedro Cuenca
Backmatter
Metadaten
Titel
MultiMedia Modeling
herausgegeben von
Cathal Gurrin
Frank Hopfgartner
Wolfgang Hurst
Håvard Johansen
Hyowon Lee
Noel O’Connor
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-04114-8
Print ISBN
978-3-319-04113-1
DOI
https://doi.org/10.1007/978-3-319-04114-8