main-content

## Über dieses Buch

The two-volume set LNCS 10132 and 10133 constitutes the thoroughly refereed proceedings of the 23rd International Conference on Multimedia Modeling, MMM 2017, held in Reykjavik, Iceland, in January 2017.

Of the 149 full papers submitted, 36 were selected for oral presentation and 33 for poster presentation; of the 34 special session papers submitted, 24 were selected for oral presentation and 2 for poster presentation; in addition, 5 demonstrations were accepted from 8 submissions, and all 7 submissions to VBS 2017. All papers presented were carefully reviewed and selected from 198 submissions.

MMM is a leading international conference for researchers and industry practitioners for sharing new ideas, original research results and practical development experiences from all MMM related areas, broadly falling into three categories: multimedia content analysis; multimedia signal processing and communications; and multimedia applications and services.

## Inhaltsverzeichnis

### A Comparative Study for Known Item Visual Search Using Position Color Feature Signatures

According to the results of the Video Browser Showdown competition, position-color feature signatures proved to be an effective model for visual known-item search tasks in BBC video collections. In this paper, we investigate details of the retrieval model based on feature signatures, given a state-of-the-art known item search tool – Signature-based Video Browser. We also evaluate a preliminary comparative study for three variants of the utilizes distance measures. In the discussion, we analyze logs and provide clues for understanding the performance of our model.

Jakub Lokoč, David Kuboň, Adam Blažek

### A Novel Affective Visualization System for Videos Based on Acoustic and Visual Features

With the fast development of social media in recent years, affective video content analysis has become a hot research topic and the relevant techniques are adopted by quite a few popular applications. In this paper, we firstly propose a novel set of audiovisual movie features to improve the accuracy of affective video content analysis, including seven audio features, eight visual features and two movie grammar features. Then, we propose an iterative method with low time complexity to select a set of more significant features for analyzing a specific emotion. And then, we adopt the BP (Back Propagation) network and circumplex model to map the low-level audiovisual features onto high-level emotions. To validate our approach, a novel video player with affective visualization is designed and implemented, which makes emotion visible and accessible to audience. Finally, we built a video dataset including 2000 video clips with manual affective annotations, and conducted extensive experiments to evaluate our proposed features, algorithms and models. The experimental results reveals that our approach outperforms state-of-the-art methods.

Jianwei Niu, Yiming Su, Shasha Mo, Zeyu Zhu

### A Novel Two-Step Integer-pixel Motion Estimation Algorithm for HEVC Encoding on a GPU

Integer-pixel Motion Estimation (IME) is one of the fundamental and time-consuming modules in encoding. In this paper, a novel two-step IME algorithm is proposed for High Efficiency Video Coding (HEVC) on a Graphic Processing Unit (GPU). First, the whole search region is roughly investigated with a predefined search pattern, which is analyzed in detail to effectively reduce the complexity. Then, the search result is further refined in the zones only around the best candidates of the first step. By dividing IME into two steps, the proposed algorithm combines the advantage of one-step algorithms in synchronization and the advantage of multiple-step algorithms in complexity. According to the experimental results, the proposed algorithm achieves up to 3.64 times speedup compared with previous representative algorithms, and the search accuracy is maintained at the same time. Since IME algorithm is independent from other modules, it is a good choice for different GPU-based encoding applications.

Keji Chen, Jun Sun, Zongming Guo, Dachuan Zhao

### A Scalable Video Conferencing System Using Cached Facial Expressions

We propose a scalable video conferencing system that streams High-Definition videos (when bandwidth is sufficient) and ultra-low-bitrate ($${<}0.25$$ kbps) cached facial expressions (when the bandwidth is scarce). Our solution consists of optimized approaches to: (i) choose representative facial expressions from training video frames and (ii) match an incoming Webcam frame against the pre-transmitted facial expressions. To the best of our knowledge, such approach has never been studied in the literature. We evaluate the implemented video conferencing system using Webcam videos captured from 9 subjects. Compared to the state-of-the-art scalable codec, our solution: (i) reduces the bitrate by about 130 times when the bandwidth is scarce, (ii) achieves the same coding efficiency when the bandwidth is sufficient, (iii) allows exercising the tradeoff between initialization overhead and coding efficiency, (iv) performs better when the resolution is higher, and (v) runs reasonably fast before extensive code optimization.

Fang-Yu Shih, Ching-Ling Fan, Pin-Chun Wang, Cheng-Hsin Hsu

### A Unified Framework for Monocular Video-Based Facial Motion Tracking and Expression Recognition

This paper proposes a unified facial motion tracking and expression recognition framework for monocular video. For retrieving facial motion, an online weight adaptive statistical appearance method is embedded into the particle filtering strategy by using a deformable facial mesh model served as an intermediate to bring input images into correspondence by means of registration and deformation. For recognizing facial expression, facial animation and facial expression are estimated sequentially for fast and efficient applications, in which facial expression is recognized by static anatomical facial expression knowledge. In addition, facial animation and facial expression are simultaneously estimated for robust and precise applications, in which facial expression is recognized by fusing static and dynamic facial expression knowledge. Experiments demonstrate the high tracking robustness and accuracy as well as the high facial expression recognition score of the proposed framework.

Jun Yu

### A Virtual Reality Framework for Multimodal Imagery for Vessels in Polar Regions

Maintaining total awareness when maneuvering an ice-breaking vessel is key to its safe operation. Camera systems are commonly used to augment the capabilities of those piloting the vessel, but rarely are these camera systems used beyond simple video feeds. To aid in visualization for decision making and operation, we present a scheme for combining multiple modalities of imagery into a cohesive Virtual Reality application which provides the user with an immersive, real scale, view of conditions around a research vessel operating in polar waters. The system incorporates imagery from a $$360^{\circ }$$ Long-wave Infrared camera as well as an optical band stereo camera system. The Virtual Reality application allows the operator multiple natural ways of interacting with and observing the data, as well as provides a framework for further inputs and derived observations.

Scott Sorensen, Abhishek Kolagunda, Andrew R. Mahoney, Daniel P. Zitterbart, Chandra Kambhamettu

### Adaptive and Optimal Combination of Local Features for Image Retrieval

With the large number of local feature detectors and descriptors in the literature of Content-Based Image Retrieval (CBIR), in this work we propose a solution to predict the optimal combination of features, for improving image retrieval performances, based on the spatial complementarity of interest point detectors. We review several complementarity criteria of detectors and employ them in a regression based prediction model, designed to select the suitable detectors combination for a dataset. The proposal can improve retrieval performance even more by selecting optimal combination for each image (and not only globally for the dataset), as well as being profitable in the optimal fitting of some parameters. The proposal is appraised on three state-of-the-art datasets to validate its effectiveness and stability. The experimental results highlight the importance of spatial complementarity of the features to improve retrieval, and prove the advantage of using this model to optimally adapt detectors combination and some parameters.

Neelanjan Bhowmik, Valérie Gouet-Brunet, Lijun Wei, Gabriel Bloch

### An Evaluation of Video Browsing on Tablets with the ThumbBrowser

We present an extension and evaluation of a novel interaction concept for video browsing on tablets. It can be argued that the best user experience for watching video on tablets can be achieved when the device is held in landscape orientation. Most mobile video players ignore this fact and make the interaction unnecessarily hard when the tablet is held with both hands. Naturally, in this hand posture only the thumbs are available for interaction. Our ThumbBrowser-interface takes this into account and combines it in its latest iteration with content analysis information as well as two different interaction methods. The interface was already introduced in a basic form in earlier work. In this paper we report on extensions that we applied and show first evaluation results in comparison to standard video players. We are able to show that our video browser is superior in terms of search accuracy and user satisfaction.

Marco A. Hudelist, Klaus Schoeffmann

### Binaural Sound Source Distance Reproduction Based on Distance Variation Function and Artificial Reverberation

In this paper, a method combining the distance variation function (DVF) and image source method (ISM) is presented to generate binaural 3D audio with accurate feeling of distance. The DVF is introduced to indicate the change in intensity and inter-aural difference when the distance between listener and source changes. Then an artificial reverberation simulated by ISM is added. The reverberation introduces the energy ratio of direct-to-reverberant, which provides an absolute cue to distance perception. The distance perception test results indicate improvement for distance perception when sound sources located within 50 cm. In addition, the variance of perceptual distance was much smaller than that using DVF only. The reduction of variance is a proof that the method proposed in this paper can generate 3D audio with more accurate and steadier feeling of distance.

Jiawang Xu, Xiaochen Wang, Maosheng Zhang, Cheng Yang, Ge Gao

### Color-Introduced Frame-to-Model Registration for 3D Reconstruction

3D reconstruction has become an active research topic with the popularity of consumer-grade RGB-D cameras, and registration for model alignment is one of the most important steps. Most typical systems adopt depth-based geometry matching, while the captured color images are totally discarded. Some recent methods further introduce photometric cue for better results, but only frame-to-frame matching is used. In this paper, a novel registration approach is proposed. According to both geometric and photometric consistency, depth and color information are involved in a unified optimization framework. With the available depth maps and color images, a global model with colored surface vertices is maintained. The incoming RGB-D frames are aligned based on frame-to-model matching for more effective camera pose estimation. Both quantitative and qualitative experimental results demonstrate that better reconstruction performance can be obtained by our proposal.

Fei Li, Yunfan Du, Rujie Liu

### Compressing Visual Descriptors of Image Sequences

In recent years, there has been significant progress in developing more compact visual descriptors, typically by aggregating local descriptors. However, all these methods are descriptors for still images, and are typically applied independently to (key) frames when used in tasks such as instance search in video. Thus, they do not make use of the temporal redundancy of the video, which has negative impacts on the descriptor size and the matching complexity. We propose a compressed descriptor for image sequences, which encodes a segment of video using a single descriptor. The proposed approach is a framework that can be used with different local descriptors, including compact descriptors. We describe the extraction and matching process for the descriptor and provide evaluation results on a large video data set.

Werner Bailer, Stefanie Wechtitsch, Marcus Thaler

### Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

With the rapid development of the Internet and the explosion of data volume, it is important to access the cross-media big data including text, image, audio, and video, etc., efficiently and accurately. However, the content heterogeneity and semantic gap make it challenging to retrieve such cross-media archives. The existing approaches try to learn the connection between multiple modalities by direct utilization of hand-crafted low-level features, and the learned correlations are merely constructed with high-level feature representations without considering semantic information. To further exploit the intrinsic structures of multimodal data representations, it is essential to build up an interpretable correlation between these heterogeneous representations. In this paper, a deep model is proposed to first learn the high-level feature representation shared by different modalities like texts and images, with convolutional neural network (CNN). Moreover, the learned CNN features can reflect the salient objects as well as the details in the images and sentences. Experimental results demonstrate that proposed approach outperforms the current state-of-the-art base methods on public dataset of Flickr8K.

Tianyuan Yu, Liang Bai, Jinlin Guo, Zheng Yang, Yuxiang Xie

### Discovering Geographic Regions in the City Using Social Multimedia and Open Data

In this paper we investigate the potential of social multimedia and open data for automatically identifying regions within the city. We conjecture that the regions may be characterized by specific patterns related to their visual appearance, the manner in which the social media users describe them, and the human mobility patterns. Therefore, we collect a dataset of Foursquare venues, their associated images and users, which we further enrich with a collection of city-specific Flickr images, annotations and users. Additionally, we collect a large number of neighbourhood statistics related to e.g., demographics, housing and services. We then represent visual content of the images using a large set of semantic concepts output by a convolutional neural network and extract latent Dirichlet topics from their annotations. User, text and visual information as well as the neighbourhood statistics are further aggregated at the level of postal code regions, which we use as the basis for detecting larger regions in the city. To identify those regions, we perform clustering based on individual modalities as well as their ensemble. The experimental analysis shows that the automatically detected regions are meaningful and have a potential for better understanding dynamics and complexity of a city.

Stevan Rudinac, Jan Zahálka, Marcel Worring

### Discovering User Interests from Social Images

The last decades have witnessed the boom of social networks. As a result, discovering user interests from social media has gained increasing attention. While the accumulation of social media presents us great opportunities for a better understanding of the users, the challenge lies in how to build a uniform model for the heterogeneous contents. In this article, we propose a hybrid mixture model for user interests discovery which exploits both the textual and visual content associated with social images. By modeling the features of each content source independently at the latent variable level and unifies them as latent interests, the proposed model allows the semantic interpretation of user interests in both the visual and textual perspectives. Qualitative and quantitative experiments on a Flickr dataset with 2.54 million images have demonstrated its promise for user interest analysis compared with existing methods.

Jiangchao Yao, Ya Zhang, Ivor Tsang, Jun Sun

### Effect of Junk Images on Inter-concept Distance Measurement: Positive or Negative?

In this paper, we focus on the problem of inter-concept distance measurement (ICDM), which is a task of computing the distance between two concepts. ICDM is generally achieved by constructing a visual model of each concept and calculating the dissimilarity score between two visual models. The process of visual concept modeling often suffers from the problem of junk images, i.e., the images whose visual content is not related to the given text-tags. Similarly, it is naively expected that junk images also give a negative effect on the performance of ICDM. On the other hand, junk images might be related to its text-tags in a certain (non-visual) sense because the text-tags are given by not automated systems but humans. Hence, the following question arises: Is the effect of junk images on the performance of ICDM positive or negative? In this paper, we aim to answer this non-trivial question from experimental aspects using a unified framework for ICDM and junk image detection. Surprisingly, our experimental result indicates that junk images give a positive effect on the performance of ICDM.

Yusuke Nagasawa, Kazuaki Nakamura, Naoko Nitta, Noboru Babaguchi

### Exploiting Multimodality in Video Hyperlinking to Improve Target Diversity

Video hyperlinking is the process of creating links within a collection of videos to help navigation and information seeking. Starting from a given set of video segments, called anchors, a set of related segments, called targets, must be provided. In past years, a number of content-based approaches have been proposed with good results obtained by searching for target segments that are very similar to the anchor in terms of content and information. Unfortunately, relevance has been obtained to the expense of diversity. In this paper, we study multimodal approaches and their ability to provide a set of diverse yet relevant targets. We compare two recently introduced cross-modal approaches, namely, deep auto-encoders and bimodal LDA, and experimentally show that both provide significantly more diverse targets than a state-of-the-art baseline. Bimodal autoencoders offer the best trade-off between relevance and diversity, with bimodal LDA exhibiting slightly more diverse targets at a lower precision.

Rémi Bois, Vedran Vukotić, Anca-Roxana Simon, Ronan Sicre, Christian Raymond, Pascale Sébillot, Guillaume Gravier

### Exploring Large Movie Collections: Comparing Visual Berrypicking and Traditional Browsing

We compare Visual Berrypicking, an interactive approach allowing users to explore large and highly faceted information spaces using similarity-based two-dimensional maps, with traditional browsing techniques. For large datasets, current projection methods used to generate maplike overviews suffer from increased computational costs and a loss of accuracy resulting in inconsistent visualizations. We propose to interactively align inexpensive small maps, showing local neighborhoods only, which ideally creates the impression of panning a large map. For evaluation, we designed a web-based prototype for movie exploration and compared it to the web interface of The Movie Database (TMDb) in an online user study. Results suggest that users are able to effectively explore large movie collections by hopping from one neighborhood to the next. Additionally, due to the projection of movie similarities, interesting links between movies can be found more easily, and thus, compared to browsing serendipitous discoveries are more likely.

Thomas Low, Christian Hentschel, Sebastian Stober, Harald Sack, Andreas Nürnberger

### Facial Expression Recognition by Fusing Gabor and Local Binary Pattern Features

Obtaining effective and discriminative facial appearance descriptors is a challenging task for facial expression recognition (FER). In this paper, a new FER method which combines two of the most successful facial appearance descriptors, namely Gabor filters and Local Binary Patterns (LBPs), is proposed considering that the former one can represent facial shape and appearance over a broader range of scales and orientations while the latter one can capture subtle appearance details. Firstly, feature vectors of Gabor and LBP representations are generated from the preprocessed face images respectively. Secondly, feature fusion is applied to combine these two vectors and dimensionality reduction is conducted. Finally, the Support Vector Machine (SVM) is adopted to classify prototypical facial expressions using still images. The experimental results on the CK+ database demonstrate that the proposed method promotes the performance compared with that using Gabor or LBP descriptor alone, and outperforms several other methods.

Yuechuan Sun, Jun Yu

### Frame-Independent and Parallel Method for 3D Audio Real-Time Rendering on Mobile Devices

As 3D audio is a fundamental medium of virtual reality (VR), 3D audio real-time rendering technique is essential for the implementation of VR, especially on the mobile devices. While constrained by the limited computational power, the computation load is too high to implement 3D audio real-time rendering on the mobile devices. To solve this problem, we propose a frame-independent and parallel method of framing convolution, to parallelize process of 3D audio rendering using head-related transfer function (HRTF). In order to refrain from the dependency of overlap-add convolution over the adjacent frames, the data of convolution result is added on the final results of the two adjacent frames. We found our method could reduce the calculation time of 3D audio rendering significantly. The results were 0.74 times, 0.5 times and 0.36 times the play duration of si03.wav (length of 27 s), with Snapdragon 801, Kirin 935 and Helio X10 Turbo, respectively.

Yucheng Song, Xiaochen Wang, Cheng Yang, Ge Gao, Wei Chen, Weiping Tu

### Illumination-Preserving Embroidery Simulation for Non-photorealistic Rendering

We present an illumination-preserving embroidery simulation method for Non-photorealistic Rendering (NPR). Our method turns an image into the embroidery style with its illumination preserved by intrinsic decomposition. This illumination-preserving feature makes our method distinctive from the previous papers, eliminating their problem of inconsistent illumination. In our method a two-dimensional stitch model is developed with some most commonly used stitch patterns, and the input image is intrinsically decomposed into a reflectance image and its corresponding shading image. The Chan-Vese active contour is adopted to segment the input image into regions, from which parameters are derived for stitch patterns. Appropriate stitch patterns are applied back onto the base material region-by-region and rendered with the intrinsic shading of the input image. Experimental results show that our method is capable of performing fine embroidery simulations, preserving the illumination of the input image.

Qiqi Shen, Dele Cui, Yun Sheng, Guixu Zhang

### Improving the Discriminative Power of Bag of Visual Words Model

With the exponential increase of image database, Content Based Image Retrieval research field has started a race to always propose more effective and efficient tools to manage massive amount of data. In this paper, we focus on improving the discriminative power of the well-known bag of visual words model. To do so, we present n-BoVW, an approach that combines visual phrase model effectiveness keeping the efficiency of visual words model with a binary based compression algorithm. Experimental results on widely used datasets (UKB, INRIA Holidays, Corel1000 and PASCAL 2012) show the effectiveness of the proposed approach.

Achref Ouni, Thierry Urruty, Muriel Visani

### M-SBIR: An Improved Sketch-Based Image Retrieval Method Using Visual Word Mapping

Sketch-based image retrieval (SBIR) systems, which interactively search photo collections using free-hand sketches depicting shapes, have attracted much attention recently. In most existing SBIR techniques, the color images stored in a database are first transformed into corresponding sketches. Then, features of the sketches are extracted to generate the sketch visual words for later retrieval. However, transforming color images to sketches will normally incur loss of information, thus decreasing the final performance of SBIR methods. To address this problem, we propose a new method called M-SBIR. In M-SBIR, besides sketch visual words, we also generate a set of visual words from the original color images. Then, we leverage the mapping between the two sets to identify and remove sketch visual words that cannot describe the original color images well. We demonstrate the performance of M-SBIR on a public data set. We show that depending on the number of different visual words adopted, our method can achieve $$9.8\sim 13.6\%$$ performance improvement compared to the classic SBIR techniques. In addition, we show that for a database containing multiple color images of the same objects, the performance of M-SBIR can be further improved via some simple techniques like co-segmentation.

Jianwei Niu, Jun Ma, Jie Lu, Xuefeng Liu, Zeyu Zhu

### Movie Recommendation via BLSTM

Traditional recommender systems have achieved remarkable success. However, they only consider users’ long-term interests, ignoring the situation when new users don’t have any profile or user delete their tracking information. In order to solve this problem, the session-based recommendations based on Recurrent Neural Networks (RNN) is proposed to make recommendations taking only the behavior of users into account in a period time. The model showed promising improvements over traditional recommendation approaches.In this paper, We apply bidirectional long short-term memory (BLSTM) on movie recommender systems to deal with the above problems. Experiments on the MovieLens dataset demonstrate relative improvements over previously reported results on the Recall@N metrics respectively and generate more reliable and personalized movie recommendations when compared with the existing methods.

Song Tang, Zhiyong Wu, Kang Chen

### Multimodal Video-to-Video Linking: Turning to the Crowd for Insight and Evaluation

Video-to-video linking systems allow users to explore and exploit the content of a large-scale multimedia collection interactively and without the need to formulate specific queries. We present a short introduction to video-to-video linking (also called ‘video hyperlinking’), and describe the latest edition of the Video Hyperlinking (LNK) task at TRECVid 2016. The emphasis of the LNK task in 2016 is on multimodality as used by videomakers to communicate their intended message. Crowdsourcing makes three critical contributions to the LNK task. First, it allows us to verify the multimodal nature of the anchors (queries) used in the task. Second, it enables us to evaluate the performance of video-to-video linking systems at large scale. Third, it gives us insights into how people understand the relevance relationship between two linked video segments. These insights are valuable since the relationship between video segments can manifest itself at different levels of abstraction.

Maria Eskevich, Martha Larson, Robin Aly, Serwah Sabetghadam, Gareth J. F. Jones, Roeland Ordelman, Benoit Huet

### Online User Modeling for Interactive Streaming Image Classification

Regarding of the explosive growth of personal images, this paper proposes an online user modeling method for the categorization of the streaming images. In the proposed framework, user interaction is brought in after an automatic classification by the learned classifier, and several strategies have been used for online user modeling. Firstly, to cover diverse personalized taxonomy, we describe images from multiple views. Secondly, to train the classifier gradually, we use an incremental variant of the nearest class mean classifier and update the class means incrementally. Finally, to learn diverse interests of different users, we propose an online learning strategy to learn weights of different feature views. Using the proposed method, user can categorize streaming images flexibly and freely without any pre-labeled images or pre-trained classifiers. And with the classification going on, the efficiency will keep increasing which could ease user’s interaction burden significantly. The experimental results and a user study demonstrated the effectiveness of our approach.

Jiagao Hu, Zhengxing Sun, Bo Li, Kewei Yang, Dongyang Li

### Recognizing Emotions Based on Human Actions in Videos

Systems for automatic analysis of videos are in high demands as videos are expanding rapidly on the Internet and understanding of the emotions carried by the videos (e.g. “anger”, “happiness”) are becoming a hot topic. While existing affective computing model mainly focusing on facial expression recognition, little attempts have been made to explore the relationship between emotion and human action. In this paper, we propose a comprehensive emotion classification framework based on spatio-temporal volumes built with human actions. To each action unit we get before, we use Dense-SIFT as descriptor and K-means to form histograms. Finally, the histograms are sent to the mRVM and recognizing the human emotion. The experiment results show that our method performs well on FABO dataset.

Guolong Wang, Zheng Qin, Kaiping Xu

### Rocchio-Based Relevance Feedback in Video Event Retrieval

This paper investigates methods for user and pseudo relevance feedback in video event retrieval. Existing feedback methods achieve strong performance but adjust the ranking based on few individual examples. We propose a relevance feedback algorithm (ARF) derived from the Rocchio method, which is a theoretically founded algorithm in textual retrieval. ARF updates the weights in the ranking function based on the centroids of the relevant and non-relevant examples. Additionally, relevance feedback algorithms are often only evaluated by a single feedback mode (user feedback or pseudo feedback). Hence, a minor contribution of this paper is to evaluate feedback algorithms using a larger number of feedback modes. Our experiments use TRECVID Multimedia Event Detection collections. We show that ARF performs significantly better in terms of Mean Average Precision, robustness, subjective user evaluation, and run time compared to the state-of-the-art.

G. L. J. Pingen, M. H. T. de Boer, R. B. N. Aly

### Scale-Relation Feature for Moving Cast Shadow Detection

Cast shadow is the problem of moving cast detection in visual surveillance applications, which has been studied over years. However, finding an efficient model that can handle the issue of moving cast shadow in various situations is still challenging. Unlike prior methods, we use a data-driven method without the strong parametric assumptions or complex models to address the problem of moving cast shadow. In this paper, we propose a novel feature-extracting framework called Scale-Relation Feature Extracting (SRFE). By leveraging the scale space, SRFE decomposes each image with various properties into various scales and further considers the relationship between adjacent scales of the two shadow properties to extract the scale-relation features. To seek the criteria for discriminating moving cast shadow, we use random forest algorithm as the ensemble decision scheme. Experimental results show that the proposed method can achieve the performances of the popular methods on the widely used dataset.

Chih-Wei Lin

### Smart Loudspeaker Arrays for Self-Coordination and User Tracking

The Internet of Things paradigm aims at developing new services through the interconnection of sensing and actuating devices. In this work, we demonstrate what can be achieved through the interaction between multiple sound devices arbitrarily deployed in space but connected through a unified network. In particular, we introduce techniques to realize a smart sound array through simultaneous synchronization and layout coordination of multiple sound devices. As a promising application of the smart sound array, we show that acoustic tracking of a user-location is possible by analyzing scattering waves induced from the exchange of acoustic signals between multiple sound objects.

Jungju Jee, Jung-Woo Choi

### Spatial Verification via Compact Words for Mobile Instance Search

Instance search is a retrieval task that searches video segments or images relevant to a certain specific instance (object, person, or location). Selecting more representative visual words is a significant challenge for the problem of instance search, since spatial relations between features are leveraged in many state-of-the-art methods. However, with the popularity of mobile devices it is now feasible to adopt multiple similar photos from mobile devices as a query to extract representative visual words. This paper proposes a novel approach for mobile instance search, by spatial analysis with a few representative visual words extracted from multi-photos. We develop a scheme that applies three criteria, including BM25 with exponential IDF (EBM25), significance in multi-photos and separability to rank visual words. Then, a spatial verification method about position relations is applied to a few visual words to obtain the weight of each photo selected. In consideration of the limited bandwidth and instability of wireless channel, our approach only transmits a few visual words from mobile client to server and the number of visual words varies with bandwidth. We evaluate our approach on Oxford building dataset, and the experimental results demonstrate a notable improvement on average precision over several state-of-the-art methods including spatial coding, query expansion and multiple photos.

Bo Wang, Jie Shao, Chengkun He, Gang Hu, Xing Xu

### Stochastic Decorrelation Constraint Regularized Auto-Encoder for Visual Recognition

Deep neural networks have achieved state-of-the-art performance on many applications such as image classification, object detection and semantic segmentation. But the difficulty of optimizing the networks still exists when training networks with a huge number of parameters. In this work, we propose a novel regularizer called stochastic decorrelation constraint (SDC) imposed on the hidden layers of the large networks, which can significantly improve the networks’ generalization capacity. SDC reduces the co-adaptions of the hidden neurons in an explicit way, with a clear objective function. In the meanwhile, we show that training the network with our regularizer has the effect of training an ensemble of exponentially many networks. We apply the proposed regularizer to the auto-encoder for visual recognition tasks. Compared to the auto-encoder without any regularizers, the SDC constrained auto-encoder can extract features with less redundancy. Comparative experiments on the MNIST database and the FERET database demonstrate the superiority of our method. When reducing the size of training data, the optimization of the network becomes much more challenging, yet our method shows even larger advantages over the conventional methods.

Fengling Mao, Wei Xiong, Bo Du, Lefei Zhang

### The Perceptual Lossless Quantization of Spatial Parameter for 3D Audio Signals

With the development of multichannel audio systems, the 3D audio systems have already come into our lives. But the increasing number of channels brought challenges to storage and transmission of large amounts of data. Spatial Audio Coding (SAC), the mainstream of 3D audio coding technologies, is key to reproduce 3D multichannel audio signals with efficient compression. Just Noticeable Difference (JND) characteristics of human auditory system can be utilized to reduce spatial perceptual redundancy in the spatial parameters quantization process of SAC. However, the current quantization methods of SAC fully combine the JND characteristics. In this paper, we proposed a Perceptual Lossless Quantization of Spatial Parameter (PLQSP) method, the azimuthal and elevational quantization step sizes of spatial parameters are combined with JNDs. Both objective and subjective experiments have conducted to prove the high efficiency of PLQSP method. Compared with reference method SLQP-L/SLQP-H, the quantization codebook size of PLQSP has decreased by 16.99% and 27.79% respectively, while preserving similar listening quality.

Gang Li, Xiaochen Wang, Li Gao, Ruimin Hu, Dengshi Li

### Unsupervised Multiple Object Cosegmentation via Ensemble MIML Learning

Multiple foreground cosegmentation (MFC) has being a new research topic recently in computer vision. This paper proposes a framework of unsupervised multiple object cosegmentation, which is composed of three components: unsupervised label generation, saliency pseudo-annotation and cosegmentation based on MIML learning. Based on object detection, unsupervised label generation is done in terms of the two-stage object clustering method, to obtain accurate consistent label between common objects without any user intervention. Then, the object label is propagated to the object saliency coming from saliency detection method, to finish saliency pseudo-annotation. This makes an unsupervised MFC problem as a supervised multi-instance multi-label (MIML) learning problem. Finally, an ensemble MIML framework is introduced to achieve image cosegmentation based on random feature selection. The experimental results on data sets ICoseg and FlickrMFC demonstrated the effectiveness of the proposed approach.

Weichen Yang, Zhengxing Sun, Bo Li, Jiagao Hu, Kewei Yang

### Using Object Detection, NLP, and Knowledge Bases to Understand the Message of Images

With the increasing amount of multimodal content from social media posts and news articles, there has been an intensified effort towards conceptual labeling and multimodal (topic) modeling of images and of their affiliated texts. Nonetheless, the problem of identifying and automatically naming the core abstract message (gist) behind images has received less attention. This problem is especially relevant for the semantic indexing and subsequent retrieval of images. In this paper, we propose a solution that makes use of external knowledge bases such as Wikipedia and DBpedia. Its aim is to leverage complex semantic associations between the image objects and the textual caption in order to uncover the intended gist. The results of our evaluation prove the ability of our proposed approach to detect gist with a best MAP score of 0.74 when assessed against human annotations. Furthermore, an automatic image tagging and caption generation API is compared to manually set image and caption signals. We show and discuss the difficulty to find the correct gist especially for abstract, non-depictable gists as well as the impact of different types of signals on gist detection quality.

Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto, Laura Dietz

### Video Search via Ranking Network with Very Few Query Exemplars

This paper addresses the challenge of video search with only a handful query exemplars by proposing a triplet ranking network-based method. Based on the typical scenario for video search system, a user begins the query process by first utilizing the metadata-based text-to-video search module to find an initial set of videos of interest in the video repository. As bridging the semantic gap between text and video is very challenging, usually only a handful relevant videos appear in the initial retrieved results. The user now can use the video-to-video search module to train a new classifier to search more relevant videos. However, since we found that statistically only fewer than 5 videos are initially relevant, training a complex event classifier with a handful of examples is extremely challenging. Therefore, it is necessary to improve video retrieval method that works for a handful of positive training example videos. The proposed triplet ranking network is mainly designed for this situation and has the following properties: (1) This ranking network can learn an off-line similarity matching projection, which is event independent, from other previous video search tasks or datasets. Such that even with only one query video, we can search its relative videos. Then this method can transfer previous knowledge to the specific video retrieval tasks as more and more relative videos being retrieved, to further improve the retrieval performance; (2) It casts the video search task as a ranking problem, and can exploit partial ordering information in the dataset; (3) Based on the above two merits, this method is suitable for the case where only a handful of positive examples exploit. Experimental results show the effectiveness of our proposed method on video retrieval with only a handful of positive exemplars.

De Cheng, Lu Jiang, Yihong Gong, Nanning Zheng, Alexander G. Hauptmann

### A Demo for Image-Based Personality Test

In this demo, we showcase an image-based personality test. Compared with the traditional text-based personality test, the proposed new test is more natural, objective, and language-insensitive. With each question consisting of images describing the same concept, the subjects are requested to choose their favorite image. Based on the choices to typically 15–25 questions, we can accurately estimate the subjects’ personality traits. The whole process costs less than 5 min. The online demo adapts well to PCs and smart phones, which is available at http://www.visualbfi.org/.

Huaiwen Zhang, Jiaming Zhang, Jitao Sang, Changsheng Xu

### A Web-Based Service for Disturbing Image Detection

As User Generated Content takes up an increasing share of the total Internet multimedia traffic, it becomes increasingly important to protect users (be they consumers or professionals, such as journalists) from potentially traumatizing content that is accessible on the web. In this demonstration, we present a web service that can identify disturbing or graphic content in images. The service can be used by platforms for filtering or to warn users prior to exposing them to such content. We evaluate the performance of the service and propose solutions towards extending the training dataset and thus further improving the performance of the service, while minimizing emotional distress to human annotators.

Markos Zampoglou, Symeon Papadopoulos, Yiannis Kompatsiaris, Jochen Spangenberg

### An Annotation System for Egocentric Image Media

Manual annotation of ego-centric visual media for lifelogging, activity monitoring, object counting, etc. is challenging due to the repetitive nature of the images especially for events such as driving, eating, meeting, watching television, etc. where there is no change in scenery. This makes the annotation task boring and there is danger of missing things through loss of concentration. This is particularly problematic when labelling infrequently or irregularly occurring objects or short activities. To date annotation approaches have structured visual lifelogs into events and then annotated at the event or sub-event levels but this can be limited when the annotation task is labelling a wider variety of topics-events, activities, interactions and/or objects. Here we build on our prior experiences of annotating at event level and present a new annotation interface. This demonstration will show a software platform for annotating different levels of labels by different projects, with different aims, for ego-centric visual media.

Aaron Duane, Jiang Zhou, Suzanne Little, Cathal Gurrin, Alan F. Smeaton

### DeepStyleCam: A Real-Time Style Transfer App on iOS

In this demo, we present a very fast CNN-based style transfer system running on normal iPhones. The proposed app can transfer multiple pre-trained styles to the video stream captured from the built-in camera of an iPhone around 140ms (7fps). We extended the network proposed as a real-time neural style transfer network by Johnson et al. [1] so that the network can learn multiple styles at the same time. In addition, we modified the CNN network so that the amount of computation is reduced one tenth compared to the original network. The very fast mobile implementation of the app are based on our paper [2] which describes several new ideas to implement CNN on mobile devices efficiently. Figure 1 shows an example usage of DeepStyleCam which is running on an iPhone SE.

Ryosuke Tanno, Shin Matsuo, Wataru Shimoda, Keiji Yanai

### V-Head: Face Detection and Alignment for Facial Augmented Reality Applications

Efficient and accurate face detection and alignment are key techniques for facial augmented reality (AR) applications. In this paper, we introduce V-Head, a facial AR system which consists of three major components: (1) joint face detection and shape initialization which can efficiently localize facial regions based on the proposed face probability map and a multipose classifier and meanwhile explicitly produces a roughly aligned initial shape, (2) cascade face alignment to locate 2D facial landmarks on the detected face, and (3) 3D head pose estimation based on the perspective-n-point (PnP) algorithm so as to overlay 3D virtual objects on the detected faces. The demonstration can be accessed from https://drive.google.com/open?id=0B-H2fYiPunUtRHBFTDRzRkZvVEE.

Zhiwei Wang, Xin Yang

### Collaborative Feature Maps for Interactive Video Search

This extended demo paper summarizes our interface used for the Video Browser Showdown (VBS) 2017 competition, where visual and textual known-item search (KIS) tasks, as well as ad-hoc video search (AVS) tasks in a 600-h video archive need to be solved interactively. To this end, we propose a very flexible distributed video search system that combines many ideas of related work in a novel and collaborative way, such that several users can work together and explore the video archive in a complementary manner. The main interface is a perspective Feature Map, which shows keyframes of shots arranged according to a selected content similarity feature (e.g., color, motion, semantic concepts, etc.). This Feature Map is accompanied by additional views, which allow users to search and filter according to a particular content feature. For collaboration of several users we provide a cooperative heatmap that shows a synchronized view of inspection actions of all users. Moreover, we use collaborative re-ranking of shots (in specific views) based on retrieved results of other users.

Klaus Schoeffmann, Manfred Jürgen Primus, Bernd Muenzer, Stefan Petscharnig, Christof Karisch, Qing Xu, Wolfgang Huerst

### Concept-Based Interactive Search System

Our successful multimedia event detection system at TREC-VID 2015 showed its strength on handling complex concepts in a query. The system was based on a large number of pre-trained concept detectors for textual-to-visual relation. In this paper, we enhance the system by enabling human-in-the-loop. In order to facilitate a user to quickly find an information need, we incorporate concept screening, video reranking by highlighted concepts, relevance feedback and color sketch to refine a coarse retrieval result. The aim is to eventually come up with a system suitable for both Ad-hoc Video Search and Known-Item Search. In addition, as the increasing awareness of difficulty in distinguishing shots of very similar scenes, we also explore the automatic story annotation along the timeline of a video, so that a user can quickly grasp the story happened in the context of a target shot and reject shots with incorrect context. With the story annotation, a user can refine the search result as well by simply adding a few keywords in a special “context field” of a query.

Yi-Jie Lu, Phuong Anh Nguyen, Hao Zhang, Chong-Wah Ngo

### Enhanced Retrieval and Browsing in the IMOTION System

This paper presents the IMOTION system in its third version. While still focusing on sketch-based retrieval, we improved upon the semantic retrieval capabilities introduced in the previous version by adding more detectors and improving the interface for semantic query specification. In addition to previous year’s system, we increase the role of features obtained from Deep Neural Networks in three areas: semantic class labels for more entry-level concepts, hidden layer activation vectors for query-by-example and 2D semantic similarity results display. The new graph-based result navigation interface further enriches the system’s browsing capabilities. The updated database storage system $$\textsf {ADAM}_{{pro }}$$ designed from the ground up for large scale multimedia applications ensures the scalability to steadily growing collections.

Luca Rossetto, Ivan Giangreco, Claudiu Tănase, Heiko Schuldt, Stéphane Dupont, Omar Seddati

### Semantic Extraction and Object Proposal for Video Search

In this paper, we propose two approaches to deal with the problems of video searching: ad-hoc video search and known item search. First, we propose to combine multiple semantic concepts extracted from multiple networks trained on many data domains. Second, to help user find exactly video shot that has been shown before, we propose a sketch based search system which detects and indexes many objects proposed by an object proposal algorithm. By this way, we not only leverage the concepts but also the spatial relations between them.

Vinh-Tiep Nguyen, Thanh Duc Ngo, Duy-Dinh Le, Minh-Triet Tran, Duc Anh Duong, Shin’ichi Satoh

### Storyboard-Based Video Browsing Using Color and Concept Indices

We present an interface for interactive video browsing where users visually skim storyboard representations of the files in search for known items (known-item search tasks) and textually described subjects, objects, or events (ad-hoc search tasks). Individual segments of the video are represented as a color-sorted storyboard that can be addressed via a color-index. Our storyboard representation is optimized for quick visual inspections considering results from our ongoing research. In addition, a concept based-search is used to filter out parts of the storyboard containing the related concept(s), thus complementing the human-based visual inspection with a semantic, content-based annotation.

Wolfgang Hürst, Algernon Ip Vai Ching, Klaus Schoeffmann, Manfred J. Primus

### VERGE in VBS 2017

This paper presents VERGE interactive video retrieval engine, which is capable of browsing and searching into video content. The system integrates several content-based analysis and retrieval modules including concept detection, clustering, visual similarity search, object-based search, query analysis and multimodal and temporal fusion.

Anastasia Moumtzidou, Theodoros Mironidis, Fotini Markatopoulou, Stelios Andreadis, Ilias Gialampoukidis, Damianos Galanopoulos, Anastasia Ioannidou, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris, Ioannis Patras

### Video Hunter at VBS 2017

After almost three years of development, the Video Hunter tool (formerly the Signature-Based Video Browser) has become a complex tool combining different query modalities, multi-sketches, visualizations and browsing techniques. In this paper, we present additional improvements of the tool focusing on keyword search. More specifically, we present a method relying on an external image search engine and a method relying on ImageNet labels. We also present a keyframe caching method employed by our tool.

Adam Blaz̆ek, Jakub Lokoc̆, David Kubon̆

### Backmatter

Weitere Informationen