Skip to main content
main-content

Über dieses Buch

The two-volume set LNCS 10132 and 10133 constitutes the thoroughly refereed proceedings of the 23rd International Conference on Multimedia Modeling, MMM 2017, held in Reykjavik, Iceland, in January 2017.

Of the 149 full papers submitted, 36 were selected for oral presentation and 33 for poster presentation; of the 34 special session papers submitted, 24 were selected for oral presentation and 2 for poster presentation; in addition, 5 demonstrations were accepted from 8 submissions, and all 7 submissions to VBS 2017. All papers presented were carefully reviewed and selected from 198 submissions.

MMM is a leading international conference for researchers and industry practitioners for sharing new ideas, original research results and practical development experiences from all MMM related areas, broadly falling into three categories: multimedia content analysis; multimedia signal processing and communications; and multimedia applications and services.

Inhaltsverzeichnis

Frontmatter

Erratum to: ReMagicMirror: Action Learning Using Human Reenactment with the Mirror Metaphor

Fabian Lorenzo Dayrit, Ryosuke Kimura, Yuta Nakashima, Ambrosio Blanco, Hiroshi Kawasaki, Katsushi Ikeuchi, Tomokazu Sato, Naokazu Yokoya

Full Papers Accepted for Oral Presentation

Frontmatter

3D Sound Field Reproduction at Non Central Point for NHK 22.2 System

Reducing channel number is convenient for NHK 22.2 system in loudspeaker layout and good for the application of NHK 22.2 in family environment. In 2011, Akio Ando has proposed a down-mixing method which could simplify 22.2 multichannel system to 10.2 and 8.2 multichannel system, but this method only could perfect reproduce 3D sound field at a central listening point. In practice, people may stay at a non central point, Ando’s method could not maintain sound physical properties at non central point well. Conventional non central zone sound filed reproduction methods such as pressure matching method and particle velocity matching method have theoretical limitations. This paper propose a general down-mixing method basing on the position of listening point and sound physical properties, it could produce a sweet spot at any non central point in reconstruction field and reduce channel number. In experiments, the proposed method simplifies 22 channel system to 10 channel system, experimental results demonstrate that it performs better than traditional method at non central point sound field reconstruction.

Song Wang, Ruimin Hu, Shihong Chen, Xiaochen Wang, Yuhong Yang, Weiping Tu, Bo Peng

A Comparison of Approaches for Automated Text Extraction from Scholarly Figures

So far, there has not been a comparative evaluation of different approaches for text extraction from scholarly figures. In order to fill this gap, we have defined a generic pipeline for text extraction that abstracts from the existing approaches as documented in the literature. In this paper, we use this generic pipeline to systematically evaluate and compare 32 configurations for text extraction over four datasets of scholarly figures of different origin and characteristics. In total, our experiments have been run over more than 400 manually labeled figures. The experimental results show that the approach BS-4OS results in the best F-measure of 0.67 for the Text Location Detection and the best average Levenshtein Distance of 4.71 between the recognized text and the gold standard on all four datasets using the Ocropy OCR engine.

Falk Böschen, Ansgar Scherp

A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding

Lossy image and video compression algorithms yield visually annoying artifacts including blocking, blurring, and ringing, especially at low bit-rates. To reduce these artifacts, post-processing techniques have been extensively studied. Recently, inspired by the great success of convolutional neural network (CNN) in computer vision, some researches were performed on adopting CNN in post-processing, mostly for JPEG compressed images. In this paper, we present a CNN-based post-processing algorithm for High Efficiency Video Coding (HEVC), the state-of-the-art video coding standard. We redesign a Variable-filter-size Residue-learning CNN (VRCNN) to improve the performance and to accelerate network training. Experimental results show that using our VRCNN as post-processing leads to on average 4.6% bit-rate reduction compared to HEVC baseline. The VRCNN outperforms previously studied networks in achieving higher bit-rate reduction, lower memory cost, and multiplied computational speedup.

Yuanying Dai, Dong Liu, Feng Wu

A Framework of Privacy-Preserving Image Recognition for Image-Based Information Services

Nowadays mobile devices such as smartphones are widely used all over the world. Moreover, the performance of image recognition has dramatically increased by deep learning technologies. From these backgrounds, we think that the following scenario of information services could be realized in the near future: users take a photo and send it to a server, who recognizes the location in the photo and returns the users some information about the recognized location. However, this kind of client-server-based image recognition can cause a privacy issue because image recognition results are sometimes privacy sensitive. To tackle the privacy issue, in this paper, we propose a novel framework for privacy-preserving image recognition in which the server cannot uniquely identify the recognition result but users can do so. An overview of the proposed framework is as follows: First users extract a visual feature from their taken photo and transform it so that the server cannot uniquely identify the recognition result. Then users send the transformed feature to the server, who returns a candidate set of recognition results to the users. Finally, the users compare the candidates and the original visual feature for obtaining the final result. Our experimental results demonstrate the effectiveness of the proposed framework.

Kojiro Fujii, Kazuaki Nakamura, Naoko Nitta, Noboru Babaguchi

A Real-Time 3D Visual Singing Synthesis: From Appearance to Internal Articulators

A facial animation system is proposed for visual singing synthesis. With a reconstructed 3D head mesh model, both finite element method and anatomical model are used to simulate articulatory deformation corresponding to each phoneme with musical note. Based on an articulatory song corpus, articulatory movements, phonemes and musical notes are trained simultaneously to obtain the visual co-articulation model by a context-dependent Hidden Markov Model. Articulatory animations corresponding to all phonemes are concatenated by visual co-articulation model to produce the song synchronized articulatory animation. Experimental results demonstrate the system can synthesize realistic song synchronized articulatory animation for increasing the human computer interaction capability objectively and subjectively.

Jun Yu

A Structural Coupled-Layer Tracking Method Based on Correlation Filters

A recent trend in visual tracking is to employ correlation filter based formulations for their high efficiency and superior performance. To deal with partial occlusion issue, part-based methods via correlation filters have been introduced to visual tracking and achieved promising results. However, these methods ignore the intrinsic relationships among local parts and do not consider the spatial structure inside the target. In this paper, we propose a coupled-layer tracking method based on correlation filters that resolves this problem by incorporating structural constraints between the global bounding box and local parts. In our method, the target state is optimized jointly in a unified objective function taking into account both appearance information of all parts and structural constraint between parts. In that way, our method can not only have the advantages of existing correlation filter trackers, such as high efficiency and robustness, and the ability to handle partial occlusion well due to part-based strategy, but also preserve object structure. Experimental results on a challenging benchmark dataset demonstrate that our proposed method outperforms state-of-art trackers.

Sheng Chen, Bin Liu, Chang Wen Chen

Augmented Telemedicine Platform for Real-Time Remote Medical Consultation

Current telemedicine systems for remote medical consultation are based on decades old video-conferencing technology. Their primary role is to deliver video and voice communication between medical providers and to transmit vital signs of the patient. This technology, however, does not provide the expert physician with the same hands-on experience as when examining a patient in person. Virtual and Augmented Reality (VR and AR) on the other hand have the capacity to enhance the experience and communication between healthcare professionals in geographically distributed locations. By transmitting RGB+D video of the patient, the expert physician can interact with this real-time 3D representation in novel ways. Furthermore, the use of AR technology at the patient side has potential to improve communication by providing clear visual instructions to the caregiver. In this paper, we propose a framework for 3D real-time communication that combines interaction via VR and AR. We demonstrate the capabilities of our framework on a prototype system consisting of a depth camera, projector and 3D display. The system is used to analyze the network performance and data transmission quality of the multimodal streaming in a remote scenario.

David Anton, Gregorij Kurillo, Allen Y. Yang, Ruzena Bajcsy

Color Consistency for Photo Collections Without Gamut Problems

In this paper, we present a color consistency technique in order to make images in the same collection share the same color style and to avoid gamut problems. Some previous methods define simple global parameter-based models and use optimizing algorithms to obtain the unknown parameters, which usually cause gamut problems in bright and dark regions. Our method is based on the range-preserving histogram specification and can enforce images to share the same color style, without resulting in gamut problems. We divide the input images into two sets having respectively high visual quality and low visual quality. The high visual quality images are used to make color balance. And then the low visual quality images are color transferred using the previous corrected high quality images. Our experiments indicate that such histogram-based color correction method is better than the compared algorithm.

Qi-Chong Tian, Laurent D. Cohen

Comparison of Fine-Tuning and Extension Strategies for Deep Convolutional Neural Networks

In this study we compare three different fine-tuning strategies in order to investigate the best way to transfer the parameters of popular deep convolutional neural networks that were trained for a visual annotation task on one dataset, to a new, considerably different dataset. We focus on the concept-based image/video annotation problem and use ImageNet as the source dataset, while the TRECVID SIN 2013 and PASCAL VOC-2012 classification datasets are used as the target datasets. A large set of experiments examines the effectiveness of three fine-tuning strategies on each of three different pre-trained DCNNs and each target dataset. The reported results give rise to guidelines for effectively fine-tuning a DCNN for concept-based visual annotation.

Nikiforos Pittaras, Foteini Markatopoulou, Vasileios Mezaris, Ioannis Patras

Describing Geographical Characteristics with Social Images

Images play important roles in providing comprehensive understanding of our physical world. When thinking of a tourist city, one can immediately imagine pictures of its famous attractions. With the boom of social images, we attempt to explore the possibility of describing geographical characteristics of different regions. We here propose a Geographical Latent Attribute Model (GLAM) to mine regional characteristics from social images, which is expected to provide a comprehensive view of the regions. The model assumes that a geographical region consists of different “attributes” (e.g., infrastructures, attractions, events and activities) and “attributes” are interpreted by different image “clusters”. Both “attributes” and image “clusters” are modeled as latent variables. The experimental analysis on a collection of 2.5M Flickr photos regarding Chinese provinces and cities has shown that the proposed model is promising in describing regional characteristics. Moreover, we demonstrate the usefulness of the proposed model for place recommendation.

Huangjie Zheng, Jiangchao Yao, Ya Zhang

Fine-Grained Image Recognition from Click-Through Logs Using Deep Siamese Network

Image recognition using deep network models has achieved remarkable progress in recent years. However, fine-grained recognition remains a big challenge due to the lack of large-scale well labeled dataset to train the network. In this paper, we study a deep network based method for fine-grained image recognition by utilizing the click-through logs from search engines. We use both click times and probability values to filter out the noise in click-through logs. Furthermore, we propose a deep siamese network model to fine-tune the classifier, emphasizing the subtle difference between different classes and tolerating the variation within the same class. Our method is evaluated by training with the Bing clickture-dog dataset and testing with the well labeled dog breed dataset. The results demonstrate great improvement achieved by our method compared with naive training.

Wu Feng, Dong Liu

Fully Convolutional Network with Superpixel Parsing for Fashion Web Image Segmentation

In this paper we introduce a new method for extracting deformable clothing items from still images by extending the output of a Fully Convolutional Neural Network (FCN) to infer context from local units (superpixels). To achieve this we optimize an energy function, that combines the large scale structure of the image with the local low-level visual descriptions of superpixels, over the space of all possible pixel labellings. To assess our method we compare it to the unmodified FCN network used as a baseline, as well as to the well-known Paper Doll and Co-parsing methods for fashion images.

Lixuan Yang, Helena Rodriguez, Michel Crucianu, Marin Ferecatu

Graph-Based Multimodal Music Mood Classification in Discriminative Latent Space

Automatic music mood classification is an important and challenging problem in the field of music information retrieval (MIR) and has attracted growing attention from variant research areas. In this paper, we proposed a novel multimodal method for music mood classification that exploits the complementarity of the lyrics and audio information of music to enhance the classification accuracy. We first extract descriptive sentence-level lyrics and audio features from the music. Then, we project the paired low-level features of two different modalities into a learned common discriminative latent space, which not only eliminates between modality heterogeneity, but also increases the discriminability of the resulting descriptions. On the basis of the latent representation of music, we employ a graph learning based multi-modal classification model for music mood, which takes the cross-modality similarity between local audio and lyrics descriptions of music into account for effective exploitation of correlations between different modalities. The acquired predictions of mood category for every sentence of music are then aggregated by a simple voting scheme. The effectiveness of the proposed method has been demonstrated in the experiments on a real dataset composed of more than 3,000 min of music and corresponding lyrics.

Feng Su, Hao Xue

Joint Face Detection and Initialization for Face Alignment

This paper presents a joint face detection and initialization method for cascaded face alignment. Unlike existing methods which consider face detection and initialization as separate steps, we concurrently obtain a bounding box and initial facial landmarks (i.e. shape) in one step, yielding better accuracy and efficiency. Specifically, each image region is represented using shape-indexed features [6] derived from different head poses. A multipose face detector is trained: regions whose shapes are roughly aligned with faces can have a good feature representation and are utilized as positive samples, otherwise are considered as negative samples. During the face detection phase, initial landmarks can be explicitly placed on the detected faces according to the corresponding shape-indexed features. To accelerate our method, an ultrafast face proposal method based on face probability map (FPM) and boosted classifiers. Experimental results on public datasets demonstrate superior efficiency and robustness to existing initialization schemes and great accuracy improvement for the cascaded face alignment.

Zhiwei Wang, Xin Yang

Large-Scale Product Classification via Spatial Attention Based CNN Learning and Multi-class Regression

Large-scale product classification is an essential technique for better product understanding. It can provide support to online retailers from a number of aspects. This paper discusses the CNN based product classification with the existence of class hierarchy. A SaCNN-MCR method is developed to settle this task. It decomposes the classification into two stages. Firstly, a spatial attention based CNN model that directly classifies a product to leaf classes is proposed. Compared with traditional CNNs, the proposed model focuses more on product region rather than the whole image. Secondly, the outputted CNN score together with class hierarchy clues are jointly optimized by employing a multi-class regression (MCR) based refinement, which provides another kind of data fitting that further benefits the classification. Experiments on nearly one million real-world product images show that, based on the two innovations, SaCNN-MCR steadily improves the classification performance over CNN models without these modules. Moreover, it is demonstrated that CNN features characterize product images much better than traditional features, whose classification performance outperforms those of the traditional features by a large margin.

Shanshan Ai, Caiyan Jia, Zhineng Chen

Learning Features Robust to Image Variations with Siamese Networks for Facial Expression Recognition

This paper proposes a computationally efficient method for learning features robust to image variations for facial expression recognition (FER). The proposed method minimizes the feature difference between an image under a variable image variation and a corresponding target image with the best image conditions for FER (i.e. frontal face image with uniform illumination). This is achieved by regulating the objective function during the learning process where a Siamese network is employed. At the test stage, the learned network parameters are transferred to a convolutional neural network (CNN) with which the features robust to image variations can be obtained. Experiments have been conducted on the Multi-PIE dataset to evaluate the proposed method under a large number of variations including pose and illumination. The results show that the proposed method improves the FER performance under different variations without requiring extra computational complexity.

Wissam J. Baddar, Dae Hoe Kim, Yong Man Ro

M3LH: Multi-modal Multi-label Hashing for Large Scale Data Search

Recently, hashing based technique has attracted much attention in media search community. In many applications, data have multiple modalities and multiple labels. Many hashing methods have been proposed for multi-modal data; however, they seldom consider the scenario of multiple labels or only use such information to build a simple similarity matrix, e.g., the corresponding value is 1 when two samples share at least one same label. Apparently, such methods cannot make full use of the information contained in multiple labels. Thus, a model is expected to have good performance if it can make full use of information in multi-modal and multi-label data. Motivated by this, in this paper, we propose a new method, multi-modal multi-label hashing-M3LH, which can not only work on multi-modal data, but also make full use of information contained in multiple labels. Specifically, in M3LH, we assume every label is associated with a binary code in Hamming space, and the binary code of a sample can be generated by combining the binary codes of its labels. While minimizing the Hamming distance between similar pairs and maximizing the Hamming distance between dissimilar pairs, we also learn a project matrix which can be used to generate binary codes for out-of-samples. Experimental results on three widely used data sets show that M3LH outperforms or is comparable to several state-of-the-art hashing methods.

Guan-Qun Yang, Xin-Shun Xu, Shanqing Guo, Xiao-Lin Wang

Model-Based 3D Scene Reconstruction Using a Moving RGB-D Camera

This paper presents a scalable model-based approach for 3D scene reconstruction using a moving RGB-D camera. The proposed approach enhances the accuracy of pose estimation due to exploiting the rich information in the multi-channel RGB-D image data. Our approach has lots of advantages on the reconstruction quality of the 3D scene as compared with the conventional approaches using sparse features for pose estimation. The pre-learned image-based 3D model provides multiple templates for sampled views of the model, which are used to estimate the poses of the frames in the input RGB-D video without the need of a priori internal and external camera parameters. Through template-to-frame registration, the reconstructed 3D scene can be loaded in an augmented reality (AR) environment to facilitate displaying, interaction, and rendering of an image-based AR application. Finally, we verify the ability of the established reconstruction system on publicly available benchmark datasets, and compare it with the sate-of-the-art pose estimation algorithms. The results indicate that our approach outperforms the compared methods on the accuracy of pose estimation.

Shyi-Chyi Cheng, Jui-Yuan Su, Jing-Min Chen, Jun-Wei Hsieh

Modeling User Performance for Moving Target Selection with a Delayed Mouse

The growth in networking and cloud services provides opportunities to host multimedia on remote servers, but also brings challenges to developers who must deal with added delays that degrade interactivity. A fundamental action for many computer-based multimedia applications is selecting a moving target with the mouse. While previous research has modeled both moving target selection and target selection with delay, there have not been models of moving target selection with delay. Our work presents a user study that measures the effects of delay and target speed on the time to select a moving target with a mouse, with analysis of trends and derivation of a model. The analysis shows delay and speed impact target selection time exponentially and that selection time is well-represented by a model with three terms - two with exponential relationships for delay and speed and one an important interaction term.

Mark Claypool, Ragnhild Eg, Kjetil Raaen

Multi-attribute Based Fire Detection in Diverse Surveillance Videos

Fire detection, as an immediate response of fire accident to avoid great disaster, has attracted many researchers’ focuses. However, the existing methods cannot effectively exploit the comprehensive attribute of fire to give satisfying accuracy. In this paper, we design a multi-attribute based fire detection system which combines the fire’s color, geometric, and motion attributes to accurately detect the fire in complicated surveillance videos. For geometric attribute, we propose a descriptor of shape variation by combining contour moment and line detection. Furthermore, to utilize fire’s instantaneous motion character, we design a dense optical flow based descriptor as fire’s motion attribute. Finally, we build a fire detection video dataset as the benchmark, which contains 305 fire and non-fire videos, with 135 very challenging negative samples for fire detection. Experimental results on this benchmark demonstrate that the proposed approach greatly outperforms the state-of-the-art method with 92.30% accuracy and only 8.33% false positives.

Shuangqun Li, Wu Liu, Huadong Ma, Huiyuan Fu

Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers

The problem of Near-Duplicate Video Retrieval (NDVR) has attracted increasing interest due to the huge growth of video content on the Web, which is characterized by high degree of near duplicity. This calls for efficient NDVR approaches. Motivated by the outstanding performance of Convolutional Neural Networks (CNNs) over a wide variety of computer vision problems, we leverage intermediate CNN features in a novel global video representation by means of a layer-based feature aggregation scheme. We perform extensive experiments on the widely used CC_WEB_VIDEO dataset, evaluating three popular deep architectures (AlexNet, VGGNet, GoogLeNet) and demonstrating that the proposed approach exhibits superior performance over the state-of-the-art, achieving a mean Average Precision (mAP) score of 0.976.

Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, Yiannis Kompatsiaris

No-Reference Image Quality Assessment Based on Internal Generative Mechanism

No-reference (NR) image quality assessment (IQA) aims to measure the visual quality of a distorted image without access to its non-distorted reference image. Recent neuroscience research indicates that human visual system (HVS) perceives and understands perceptual signals with an internal generative mechanism (IGM). Based on the IGM, we propose a novel and effective no-reference IQA framework in this paper. First, we decompose an image into an orderly part and a disorderly one using a computational prediction model. Then we extract the joint statistics of two local contrast features from the orderly part and local binary pattern (LBP) based structural distributions from the other part, respectively. And finally, two groups of features extracted from the complementary parts are combined to train a regression model for image quality estimation. Extensive experiments on some standard databases validate that the proposed IQA method shows highly competitive performance to state-of-the-art NR-IQA ones. Moreover, the proposed metric also demonstrates its effectiveness on the multiply-distorted images.

Xinchun Qian, Wengang Zhou, Houqiang Li

On the Exploration of Convolutional Fusion Networks for Visual Recognition

Despite recent advances in multi-scale deep representations, their limitations are attributed to expensive parameters and weak fusion modules. Hence, we propose an efficient approach to fuse multi-scale deep representations, called convolutional fusion networks (CFN). Owing to using 1 $$\times $$ 1 convolution and global average pooling, CFN can efficiently generate the side branches while adding few parameters. In addition, we present a locally-connected fusion module, which can learn adaptive weights for the side branches and form a discriminatively fused feature. CFN models trained on the CIFAR and ImageNet datasets demonstrate remarkable improvements over the plain CNNs. Furthermore, we generalize CFN to three new tasks, including scene recognition, fine-grained recognition and image retrieval. Our experiments show that it can obtain consistent improvements towards the transferring tasks.

Yu Liu, Yanming Guo, Michael S. Lew

Phase Fourier Reconstruction for Anomaly Detection on Metal Surface Using Salient Irregularity

In this paper, we propose a Phase Fourier Reconstruction (PFR) approach for anomaly detection on metal surfaces using salient irregularities. To get salient irregularity with images captured from an automatic visual inspection (AVI) system using different lighting settings, we first trained a classifier for image selection as only dark images are utilized for anomaly detection. By doing so, surface details, part design, and boundaries between foreground/background become indistinct, but anomaly regions are highlighted because of diffuse reflection caused by rough surfaces. Then PFR is applied so that regular patterns and homogeneous regions are further de-emphasized, and simultaneously, anomaly areas are distinct and located. Different from existing phase-based methods which require substantial texture information, our PFR works on both textual and non-textual images. Unlike existing template matching methods which require prior knowledge of defect-free patterns, our PFR is an unsupervised approach which detects anomalies using a single image. Experimental results on anomaly detection clearly demonstrate the effectiveness of the proposed method which outperforms several well-designed methods [8, 12, 15, 16, 18, 19] with a running time of less than 0.01 seconds per image.

Tzu-Yi Hung, Sriram Vaikundam, Vidhya Natarajan, Liang-Tien Chia

ReMagicMirror: Action Learning Using Human Reenactment with the Mirror Metaphor

We propose ReMagicMirror, a system to help people learn actions (e.g., martial arts, dances). We first capture the motions of a teacher performing the action to learn, using two RGB-D cameras. Next, we fit a parametric human body model to the depth data and texture it using the color data, reconstructing the teacher’s motion and appearance. The learner is then shown the ReMagicMirror system, which acts as a mirror. We overlay the teacher’s reconstructed body on top of this mirror in an augmented reality fashion. The learner is able to intuitively manipulate the reconstruction’s viewpoint by simply rotating her body, allowing for easy comparisons between the learner and the teacher. We perform a user study to evaluate our system’s ease of use, effectiveness, quality, and appeal.

Fabian Lorenzo Dayrit, Ryosuke Kimura, Yuta Nakashima, Ambrosio Blanco, Hiroshi Kawasaki, Katsushi Ikeuchi, Tomokazu Sato, Naokazu Yokoya

Robust Image Classification via Low-Rank Double Dictionary Learning

In recent years, dictionary learning has been widely used in various image classification applications. However, how to construct an effective dictionary for robust image classification task, in which both the training and the testing image samples are corrupted, is still an open problem. To address this, we propose a novel low-rank double dictionary learning (LRD2L) method. Unlike traditional dictionary learning methods, LRD2L simultaneously learns three components from training data: (1) a low-rank class-specific sub-dictionary for each class to capture the most discriminative features owned by each class, (2) a low-rank class-shared dictionary which models the common patterns shared by different classes and (3) a sparse error container to fit the noises in data. As a result, the class-specific information, the class-shared information and the noises contained in data are separated from each other. Therefore, the dictionaries learned by LRD2L are noiseless, and the class-specific sub-dictionary of each class can be more discriminative. Also since the common features across different classes, which are essential to the reconstruction of image samples, are preserved in class-shared dictionary, LRD2L has a powerful reconstructive capability for newly coming testing samples. Experimental results on three public available datasets reveal the effectiveness and the superiority of our approach compared to the state-of-the-art dictionary learning methods.

Yi Rong, Shengwu Xiong, Yongsheng Gao

Robust Scene Text Detection for Multi-script Languages Using Deep Learning

Text detection in natural images has been a high demand for a lot real-life applications such as image retrieval and self-navigation. This work deals with the problem of robust text detection especially for multi-script in natural scene images. Unlike the existing works that consider multi-script characters as groups of text fragments, we consider them as non-connected components. Specifically, we firstly propose a novel representation named Linked Extremal Regions (LER) to extract full characters instead of fragments of scene characters. Secondly, we propose a two-stage convolution neural networks for discriminating multi-script texts in clutter background images for more robust text detection. Experimental results on three well-known datasets, namely, ICDAR 2011, 2013 and MSRA-TD500, demonstrate that the proposed method outperforms the state-of-the-art methods, and is also language independent.

Ruo-Ze Liu, Xin Sun, Hailiang Xu, Palaiahnakote Shivakumara, Feng Su, Tong Lu, Ruoyu Yang

Robust Visual Tracking Based on Multi-channel Compressive Features

Tracking-by-detection approaches show good performance in visual tracking, which often train discriminative classifiers to separate tracking target from their surrounding background. As we know, an effective and efficient image feature plays an important role for realizing an outstanding tracker. The excellent image feature can separate the tracking object and the background more easily. Besides, the feature should effectively adapt to many boring factors such as illumination changes, appearance changes, shape variations, and partial or full occlusions, etc. To this end, in this paper, we present a novel multi-channel compressive feature, which combine rich information from multiple channels, and then project it into a low-dimension compressive feature space. After that, we designed a new visual tracker based on the multi-channel compressive features. At last, extensive comparative experiments conducted on a series of challenging sequences demonstrate that our tracker outperforms most of state-of-the-art tracking approaches, which also proves that our multi-channel compressive feature is effective and efficient.

Jianqiang Xu, Yao Lu

Single Image Super-Resolution with a Parameter Economic Residual-Like Convolutional Neural Network

Recent years have witnessed great success of convolutional neural network (CNN) for various problems both in low and high level visions. Especially noteworthy is the residual network which was originally proposed to handle high-level vision problems and enjoys several merits. This paper aims to extend the merits of residual network, such as skip connection induced fast training, for a typical low-level vision problem, i.e., single image super-resolution. In general, the two main challenges of existing deep CNN for supper-resolution lie in the gradient exploding/vanishing problem and large amount of parameters or computational cost as CNN goes deeper. Correspondingly, the skip connections or identity mapping shortcuts are utilized to avoid gradient exploding/vanishing problem. To tackle with the second problem, a parameter economic CNN architecture which has carefully designed width, depth and skip connections was proposed. Experimental results have demonstrated that the proposed CNN model can not only achieve state-of-the-art PSNR and SSIM results for single image super-resolution but also produce visually pleasant results.

Ze Yang, Kai Zhang, Yudong Liang, Jinjun Wang

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

Encoding is one of the key factors for building an effective video representation. In the recent works, super vector-based encoding approaches are highlighted as one of the most powerful representation generators. Vector of Locally Aggregated Descriptors (VLAD) is one of the most widely used super vector methods. However, one of the limitations of VLAD encoding is the lack of spatial information captured from the data. This is critical, especially when dealing with video information. In this work, we propose Spatio-temporal VLAD (ST-VLAD), an extended encoding method which incorporates spatio-temporal information within the encoding process. This is carried out by proposing a video division and extracting specific information over the feature group of each video split. Experimental validation is performed using both hand-crafted and deep features. Our pipeline for action recognition with the proposed encoding method obtains state-of-the-art performance over three challenging datasets: HMDB51 (67.6%), UCF50 (97.8%) and UCF101 (91.5%).

Ionut C. Duta, Bogdan Ionescu, Kiyoharu Aizawa, Nicu Sebe

Structure-Aware Image Resizing for Chinese Characters

This paper presents a structure-aware resizing method for Chinese character images. Compared to other image resizing approaches, the proposed method is able to preserve important features such as the width, orientation and trajectory of each stroke for a given Chinese character. The key idea of our method is to first automatically decompose the character image into strokes, and then separately resize those strokes naturally using a modified linear blend skinning approach and as-rigid-as-possible shape interpolation under the guidance of structure information. Experimental results not only verify the superiority of our method compared to the state of the art but also demonstrate its effectiveness in several real applications.

Chengdong Liu, Zhouhui Lian, Yingmin Tang, Jianguo Xiao

Supervised Class Graph Preserving Hashing for Image Retrieval and Classification

With the explosive growth of data, hashing-based techniques have attracted significant attention due to their efficient retrieval and storage reduction ability. However, most hashing methods do not have the ability of predicting the labels directly. In this paper, we propose a novel supervised hashing approach, namely Class Graph Preserving Hashing (CGPH), which can well incorporate label information into hashing codes and classify the samples with binary codes directly. Specifically, CGPH learns hashing functions by ensuring label consistency and preserving class graph similarity among hashing codes simultaneously. Then, it learns effective binary codes through orthogonal transformation by minimizing the quantization error between hashing function and binary codes. In addition, an iterative method is proposed for the optimization problem in CGPH. Extensive experiments on two large scale real-world image data sets show that CGPH outperforms or is comparable to state-of-the-art hashing methods in both image retrieval and classification tasks.

Lu Feng, Xin-Shun Xu, Shanqing Guo, Xiao-Lin Wang

Visual Robotic Object Grasping Through Combining RGB-D Data and 3D Meshes

In this paper, we present a novel framework to drive automatic robotic grasp by matching camera captured RGB-D data with 3D meshes, on which prior knowledge for grasp is pre-defined for each object type. The proposed framework consists of two modules, namely, pre-defining grasping knowledge for each type of object shape on 3D meshes, and automatic robotic grasping by matching RGB-D data with pre-defined 3D meshes. In the first module, we scan 3D meshes for typical object shapes and pre-define grasping regions for each 3D shape surface, which will be considered as the prior knowledge for guiding automatic robotic grasp. In the second module, for each RGB-D image captured by a depth camera, we recognize 2D shape of the object in it by an SVM classifier, and then segment it from background using depth data. Next, we propose a new algorithm to match the segmented RGB-D shape with predefined 3D meshes to guide robotic self-location and grasp by an automatic way. Our experimental results show that the proposed framework is particularly useful to guide camera based robotic grasp.

Yiyang Zhou, Wenhai Wang, Wenjie Guan, Yirui Wu, Heng Lai, Tong Lu, Min Cai

What Convnets Make for Image Captioning?

Nowadays, a general pipeline for the image captioning task takes advantage of image representations based on convolutional neural networks (CNNs) and sequence modeling based on recurrent neural networks (RNNs). As captioning performance closely depends on the discriminative capacity of CNNs, our work aims to investigate the effects of different Convnets (CNN models) on image captioning. We train three Convnets based on different classification tasks: single-label, multi-label and multi-attribute, and then feed visual representations from these Convnets into a Long Short-Term Memory (LSTM) to model the sequence of words. Since the three Convnets focus on different visual contents in one image, we propose aggregating them together to generate a richer visual representation. Furthermore, during testing, we use an efficient multi-scale augmentation approach based on fully convolutional networks (FCNs). Extensive experiments on the MS COCO dataset provide significant insights into the effects of Convnets. Finally, we achieve comparable results to the state-of-the-art for both caption generation and image-sentence retrieval tasks.

Yu Liu, Yanming Guo, Michael S. Lew

What are Good Design Gestures?

–Towards User- and Machine-friendly Interface–

This paper discusses gesture design for man–machine interfaces. Traditionally, gesture-interface studies have focused on improving performance, in terms of increasing speed and accuracy, in particular reducing false positives. Many studies neglect to consider the gestures’ intrinsic machine friendliness, which can improve recognition accuracy, and user friendliness, which makes a gesture easier to use and to remember. In this paper, we investigate machine- and user-friendly gestures and analyze the results of an Internet-based questionnaire in which 351 individuals were asked to assign gestures to eight operations.

Ryo Kawahata, Atsushi Shimada, Rin-ichiro Taniguchi

SS1: Social Media Retrieval and Recommendation

Frontmatter

Collaborative Dictionary Learning and Soft Assignment for Sparse Coding of Image Features

In computer vision, the bag-of-words (BoW) model has been widely applied to image related tasks, such as large scale image retrieval, image classification, and object categorization. The sparse coding (SC) method which leverages SC as a means of feature coding can guarantee both sparsity of coding vector and lower reconstruction error in the BoW model. Thus it can achieve better performance than the traditional vector quantization method. However, it suffers from the side effect introduced by the non-smooth sparsity regularizer that quite different words may be selected for similar patches to favor sparsity, resulting in the loss of correlation between the corresponding coding vectors. To address this problem, in this paper, we propose a novel soft assignment method based on index combination of top-2 large sparse codes of local descriptors to make the SC-based BoW tolerate the case of different word selection for similar patches. To further ensure similar patches select same words to generate similar coding vectors, we propose a collaborative dictionary learning method through imposing the sparse code similarity regularization factor along with the row sparsity regularization across data instances on top of group sparse coding. Experiments on the well-known public Oxford dataset demonstrate the effectiveness of our proposed methods.

Jie Liu, Sheng Tang, Yu Li

LingoSent — A Platform for Linguistic Aware Sentiment Analysis for Social Media Messages

Sentiment analysis is an important natural language processing (NLP) task and applied to a wide range of scenarios. Social media messages such as tweets often differ from formal writing, exhibiting unorthodox capitalization, expressive lengthenings, Internet slang, etc. While such characteristics are inherently beneficial for the task of sentiment analysis, they also pose new challenges for existing NLP platforms. In this article, we present a new approach to improve lexicon-based sentiment analysis by extracting and utilizing linguistic features in a comprehensive manner. In contrast to existing solutions, we design our sentiment analysis approach as a framework with data preprocessing, linguistic feature extraction and sentiment calculation being separate components. This allows for easy modification and extension of each component. More importantly, we can easily configure the sentiment calculation with respect to the extracted features to optimize sentiment analysis for different application contexts. In a comprehensive evaluation, we show that our system outperforms existing state-of-the-art lexicon-based sentiment analysis solutions.

Yuting Su, Huijing Wang

Multi-Task Multi-modal Semantic Hashing for Web Image Retrieval with Limited Supervision

As an important element of social media, social images become more and more important to our daily life. Recently, smart hashing scheme has been emerging as a promising approach to support fast social image search. Leveraging semantic labels have shown effectiveness for hashing. However, semantic labels tend to be limited in terms of quantity and quality. In this paper, we propose Multi-Task Multi-modal Semantic Hashing (MTMSH) to index large scale social image data collection with limited supervision. MTMSH improves search accuracy via improving more semantic information from two aspects. First, latent multi-modal structure among labeled and unlabeled data, is explored by Multiple Anchor Graph Learning (MAGL) to enhance the quantity of semantic information. In addition, multi-task based Share Hash Space Learning (SHSL) is proposed to improve the semantic quality. Further, MGAL and SHSL are integrated using a joint framework, where semantic function and hash functions mutually reinforce each other. Then, an alternating algorithm, whose time complexity is linear to the size of training data, is also proposed. Experimental results on two large scale real world image datasets demonstrate the effectiveness and efficiency of MTMSH.

Liang Xie, Lei Zhu, Zhiyong Cheng

Object-Based Aggregation of Deep Features for Image Retrieval

In content-based visual image retrieval, image representation is one of the fundamental issues in improving retrieval performance. Recently Convolutional Neural Network (CNN) features have shown their great success as a universal representation. However, the deep CNN features lack invariance to geometric transformations and object compositions, which limits their robustness for scene image retrieval. Since a scene image always is composed of multiple objects which are crucial components to understand and describe the scene, in this paper we propose an object-based aggregation method over the CNN features for obtaining an invariant and compact image representation for image retrieval. The proposed method represents an image through VLAD pooling of CNN features describing the underlying objects, which make the representation robust to spatial layout of objects in the scene and invariant to general geometric transformations. We evaluate the performance of the proposed method on three public ground-truth datasets by comparing with state-of-the-art approaches and promising improvements have been achieved.

Yu Bao, Haojie Li

Uyghur Language Text Detection in Complex Background Images Using Enhanced MSERs

Text detection in complex background images is an important prerequisite for many image content analysis tasks. Actually, nearly all the widely-used methods of text detection focus on English and Chinese while some minority languages, such as Uyghur language, are paid less attention by researchers. In this paper, we propose a system which detects Uyghur language text in complex background images. First, component candidates are detected by the channel-enhanced Maximally Stable Extremal Regions (MSERs) algorithm. Then, most non-text regions are removed by a two-layer filtering mechanism. Next, the remaining component regions are connected into short chains, and the short chains are expanded by an expansion algorithm to connect the missed MSERs. Finally, the chains are identified by a Random Forest classifier. Experimental comparisons on the proposed dataset prove that our algorithm is effective for detecting Uyghur language text in complex background images. The F-measure is 84.8%, much better than the state-of-the-art performance of 75.5%.

Shun Liu, Hongtao Xie, Chuan Zhou, Zhendong Mao

SS2: Modeling Multimedia Behaviors

Frontmatter

CELoF: WiFi Dwell Time Estimation in Free Environment

WiFi wireless access has been the basic living need for smart phone users in the era of mobile multimedia. A large number of WiFi hotspots have also developed into an important infrastructure of multimedia accessing in smart city. Learning the dynamic features of free-environment WiFi connections is of great help to both the customization of WiFi connection service and the strategy of mobile multimedia. While mobility prediction attracts much interest in human behavior research which is more focused on fixed environments such as university, home and office, etc., this paper investigates more challenging public regions like shopping malls. A WiFi dwell time estimation method is proposed from a crowdsourcing view, to tackle the lack of contextual information for a single individual in such free environments. This is achieved by a context-embedded longitudinal factorization (CELoF) method based on multi-way tensor factorization and experiments on real dataset demonstrate the efficacy of the proposed solution.

Chen Yan, Peng Wang, Haitian Pang, Lifeng Sun, Shiqiang Yang

Demographic Attribute Inference from Social Multimedia Behaviors: A Cross-OSN Approach

This study focuses on exploiting the dynamic social multimedia behaviors to infer the stable demographic attributes. Existing demographic attribute inference studies are devoted to developing advanced features/models or exploiting external information and knowledge. The conflicts between dynamicity of behaviors and the steadiness of demographic attributes are largely ignored. To address this issue, we introduce a cross-OSN approach to discover the shared stable patterns from users’ social multimedia behaviors on multiple Online Social Networks (OSNs). The basic assumption for the proposed approach is that, the same user’s cross-OSN behaviors are the reflection of his/her demographic attributes in different scenarios. Based on this, a coupled projection matrix extraction method is proposed for solution, where the cross-OSN behaviors are collectively projected onto the same space for demographic attribute inference. Experimental evaluation is conducted on a self-collected Google+ and Twitter dataset consisting of four types of demographic attributes as gender, age, relationship and occupation. The experimental results demonstrate the effectiveness of cross-OSN based demographic attribute inference.

Liancheng Xiang, Jitao Sang, Changsheng Xu

Understanding Performance of Edge Prefetching

When using online services, the time that users wait for the requested content to be downloaded from online servers to local devices can significantly influence user experience. To reduce user waiting time, the content which are likely to be requested in the future can be pre-downloaded to the local cache on edge proxies (i.e. edge prefetching).This paper addresses the performance issues of prefetching at edge proxies (e.g. Wi-Fi Access Points (APs), cellular base stations). We introduce an AP-based prefetching framework and study the impact of several factors on the benefit and the cost of this framework based on trace-driven simulation experiments. Useful insights which can be used to guide the design of prediction algorithms and edge prefetching systems are gained from our experimental results. First, increasing prediction window size of the prediction algorithms used by mobile applications can significantly reduce user waiting time. Second, the cache size is important to reducing user waiting time before a certain threshold. Third, the ratio of correct predictions to all actual requests (i.e. recall) can reduce user waiting time linearly while the ratio of correct predictions to all predictions (i.e. precision) will influence the traffic cost, so a trade-off should be made when designing a prediction algorithm.

Zhengyuan Pang, Lifeng Sun, Zhi Wang, Yuan Xie, Shiqiang Yang

User Identification by Observing Interactions with GUIs

Given our increasing reliance on computing devices, the security of such devices becomes ever more important. In this work, we are interested in exploiting user behaviour as a means of reducing the potential for masquerade attacks, which occur when an intruder manages to breach the system and act as an authorised user. This could be possible by using stolen passwords or by taking advantage of unlocked, unattended devices. Once the attacker has passed the authentication step, they may have full access to that machine including any private data and software. Continuous identification can be used as an effective way to prevent such attacks, where the identity of the user is checked continuously throughout the session. In addition to security purposes, a reliable dynamic identification system would be of interest for user profiling and recommendation. In this paper, we present a method for user identification which relies on modeling the behaviours of a user when interacting with the graphical user interface of a computing device. A publicly-available logging tool has been developed specifically to passively capture human-computer interactions. Two experiments have been conducted to evaluate the model, and the results show the effectiveness and reliability of the method for the dynamic user identification.

Zaher Hinbarji, Rami Albatal, Cathal Gurrin

Utilizing Locality-Sensitive Hash Learning for Cross-Media Retrieval

Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, existed approaches to cross-media retrieval are computationally expensive due to the curse of dimensionality. To efficiently retrieve in multimodal data, it is essential to reduce the proportion of irrelevant documents. In this paper, we propose a cross-media retrieval approach (FCMR) based on locality-sensitive hashing (LSH) and neural networks. Multimodal information is projected by LSH algorithm to cluster similar objects into the same hash bucket and dissimilar objects into different ones, using hash functions learned through neural networks. Once given a textual or visual query, it can be efficiently mapped to a hash bucket in which objects stored can be near neighbors of this query. Experimental results show that, in the set of the queries’ near neighbors obtained by the proposed method, the proportions of relevant documents can be much boosted, and it indicates that the retrieval based on near neighbors can be effectively conducted. Further evaluations on two public datasets demonstrate the effectiveness of the proposed retrieval method compared to the baselines.

Jia Yuhua, Bai Liang, Wang Peng, Guo Jinlin, Xie Yuxiang, Yu Tianyuan

SS3: Multimedia Computing for Intelligent Life

Frontmatter

A Sensor-Based Official Basketball Referee Signals Recognition System Using Deep Belief Networks

In a basketball game, basketball referees who have the responsibility to enforce the rules and maintain the order of the basketball game has only a brief moment to determine if an infraction has occurred, later they communicate with the scoring table using hand signals. In this paper, we propose a novel system which can not only recognize the basketball referees’ signals but also communicate with the scoring table in real-time. Deep belief network and time-domain feature are utilized to analyze two heterogeneous signals, surface electromyography (sEMG) and three-axis accelerometer (ACC) to recognize dynamic gestures. Our recognition method is evaluated by a dataset of 9 various official hand signals performed by 11 subjects. Our recognition model achieves acceptable accuracy rate, which is 97.9% and 90.5% for 5-fold Cross Validation (5-foldCV) and Leave-One-Participant-Out Cross Validation (LOPOCV) experiments, respectively. The accuracy of LOPOCV experiment can be further improved to 94.3% by applying user calibration.

Chung-Wei Yeh, Tse-Yu Pan, Min-Chun Hu

Compact CNN Based Video Representation for Efficient Video Copy Detection

Many content-based video copy detection (CCD) systems have been proposed to identify the copies of a copyrighted video. Due to storage cost and retrieval response requirements, most CCD systems represent video contents using sparsely sampled features, which tends to lose information to some extend and thus results in unsatisfactory performance. In this paper, we propose a compact video representation based on convolutional neural network (CNN) and sparse coding (SC) for video copy detection. We first extract CNN features from the densely sampled video frames and then encode them into a fixed length vector via the SC method. The proposed representation presents two advantages. First, it is compact while is regardless of the sampling frame rate. Second, it is discriminative for video copy detection by encoding the densely sampled frames’ CNN features. We evaluate the performance of proposed representation on video copy detection over a real complex video dataset and marginal performance improvement has been achieved as compared to state-of-the-art CCD systems.

Ling Wang, Yu Bao, Haojie Li, Xin Fan, Zhongxuan Luo

Cross-Modal Recipe Retrieval: How to Cook this Dish?

In social media users like to share food pictures. One intelligent feature, potentially attractive to amateur chefs, is the recommendation of recipe along with food. Having this feature, unfortunately, is still technically challenging. First, the current technology in food recognition can only scale up to few hundreds of categories, which are yet to be practical for recognizing ten of thousands of food categories. Second, even one food category can have variants of recipes that differ in ingredient composition. Finding the best-match recipe requires knowledge of ingredients, which is a fine-grained recognition problem. In this paper, we consider the problem from the viewpoint of cross-modality analysis. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence from images and recipes. As learning happens at the region level for image and ingredient level for recipe, the model has ability to generalize recognition to unseen food categories. Furthermore, the embedded multi-modal ingredient feature sheds light on the retrieval of best-match recipes. On an in-house dataset, our model can double the retrieval performance of DeViSE, a popular cross-modality model but not considering region information during learning.

Jingjing Chen, Lei Pang, Chong-Wah Ngo

Deep Learning Based Intelligent Basketball Arena with Energy Image

With the development of computer vision and artificial intelligence technologies, the “Intelligent Arena” is becoming one of the new-emerging applications and research topics. Different from conventional sports video highlight detection, the intelligent playground can supply real-time and automatic sport video broadcast, highlight video generation, and sport technological analysis. In this paper, we have proposed a deep learning based intelligent basketball arena system to automatically broadcast the basketball match. First of all, with multiple cameras around the playground, the proposed system can automatically select the best camera to supply real-time high-quality broadcast. Furthermore, with basketball energy image and deep conventional neural network, we can accurately capture the scoring clips as the highlight video clips to supply the wonderful actions replay and online sharing. Finally, evaluations on a built real-world basketball match dataset demonstrate that the proposed system can obtain 94.59% accuracy with only 45 ms processing time (i.e., 10 ms live camera selection, 30 ms hotspot area detection, and 5 ms BEI+CNN) for each frame. As the outstanding performance, the proposed system has already been integrated into the commercial intelligent basketball arena applications.

Wu Liu, Jiangyu Liu, Xiaoyan Gu, Kun Liu, Xiaowei Dai, Huadong Ma

Efficient Multi-scale Plane Extraction Based RGBD Video Segmentation

To improve the robustness and efficiency of RGBD video segmentation, we propose a novel video segmentation method combining multi-scale plane extraction and hierarchical graph-based video segmentation. Firstly, to reduce depth data noise, we extract plane structures of 3D RGBD point clouds in three levels including voxel, pixel and neighborhood with geometry and color features. To solve uneven distribution of depth data and object occlusion problem, we further propose multi-scale voxel based plane fusion algorithm and use amodal completion strategy to improve plane extraction performance. Then hierarchical graph-based RGBD video segmentation is used to segment the rest of the non-plane pixels. Finally, we fuse above plane extraction and video segmentation results to get final RGBD video scene segmentation results. The qualitative and quantitative results of plane extraction and RGBD scene video segmentation show the effectiveness of proposed methods.

Hong Liu, Jun Wang, Xiangdong Wang, Yueliang Qian

Human Pose Tracking Using Online Latent Structured Support Vector Machine

Tracking human poses in a video is a challenging problem and has numerous applications. The task is particularly difficult in realistic scenes because of several intrinsic and extrinsic factors, including complicated and fast movements, occlusions and lighting changes. We propose an online learning approach for tracking human poses using latent structured Support Vector Machine (SVM). The first frame in a video is used for training, in which body parts are initialized by users and tracking models are learned using latent structured SVM. The models are updated for each subsequent frame in the video sequence. To solve the occlusion problem, we formulate a Prize-Collecting Steiner tree (PCST) problem and use a branch-and-cut algorithm to refine the detection of body parts. Experiments using several challenging videos demonstrate that the proposed method outperforms two state-of-the-art methods.

Kai-Lung Hua, Irawati Nurmala Sari, Mei-Chen Yeh

Micro-Expression Recognition by Aggregating Local Spatio-Temporal Patterns

Micro-expression is an extremely quick facial expression that reveals people’s hidden emotions, which has become one of the most important clues for lies as well as many other applications. Current methods mostly focus on the micro-expression recognition based on the simplified environment. This paper aims at developing a discriminative feature descriptor that are less sensitive to variants in pose, illumination, etc., and thus better implement the recognition task. Our novelty lies in the use of local statistical features from interest regions in which AUs (Action Units) indicate micro-expressions and the combination of these features for the recognition. To this end, we first use a face alignment algorithm to locate the face landmarks in each video frame. The positioned face is then divided to several specific regions (facial cubes) based on the location of the feature points. In the following, the movement tendency and intensity in each region are extracted using optical flow orientation histogram and Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) feature respectively. The two kinds of features are concatenated region-by-region to generate the proposed local statistical descriptor. We evaluate the local descriptor using state-of-the-art classifiers in the experiments. It is observed that the proposed local statistical descriptor, which is located by the facial spatial distribution, can capture more detailed and representative information than the global features, and the fusion of different local features can inspire more characteristics of micro-expressions than the single feature, leading to better experimental results.

Shiyu Zhang, Bailan Feng, Zhineng Chen, Xiangsheng Huang

egoPortray: Visual Exploration of Mobile Communication Signature from Egocentric Network Perspective

The coming big data era calls for new methodologies to process and analyze the huge volumes of data. Visual analytics is becoming increasingly crucial in data analysis, presentation, and exploring. Communication data is significant in studying human interactions and social relationships. In this paper, we propose a visual analytics system named egoPortray to interactively analyze the communication data based on directed weighted ego network model. Ego network (EN) is composed of a centered individual, its direct contacts (alters), and the interactions among them. Based on the EN model, egoPortray presents an overall statistical view to grasp the entire EN features distributions and correlations, and a glyph-based group view to illustrate the key EN features for comparing different egos. The proposed system and the idea of ego network can be generalized and applied in other fields where network structure exits.

Qing Wang, Jiansu Pu, Yuanfang Guo, Zheng Hu, Hui Tian

i-Stylist: Finding the Right Dress Through Your Social Networks

Searching the Web has become an everyday task for most people. However, the presence of too much information can cause information overload. For example, when shopping online, a user can easily be overwhelmed by too many choices. To this end, we propose a personalized clothing recommendation system, namely i-Stylist, through the analysis of personal images in social networks. To access the available personal images of a user, the i-Stylist system extracts a number of characteristics from each clothing item such as CNN feature vectors and metadata such as color, material and pattern of the fabric. Then, these clothing items are organized as a fully connected graph to later infer the personalized probability distribution of how the user will like each clothing item in a shopping website. The user is able to modify the graph structure, e.g. adding and deleting vertices by giving feedback about the retrieved clothing items. The i-Stylist system is compared against two other baselines and demonstrated to have better performance.

Jordi Sanchez-Riera, Jun-Ming Lin, Kai-Lung Hua, Wen-Huang Cheng, Arvin Wen Tsui

SS4: Multimedia and Multimodal Interaction for Health and Basic Care Applications

Frontmatter

Boredom Recognition Based on Users’ Spontaneous Behaviors in Multiparty Human-Robot Interactions

To recognize boredom in users interacting with machines is valuable to improve user experiences in human-machine long term interactions, especially for intelligent tutoring systems, health-care systems, and social assistants. This paper proposes a two-staged framework and feature design for boredom recognition in multiparty human-robot interactions. At the first stage the proposed framework detects boredom-indicating user behaviors based on skeletal data obtained by motion capture, and then it recognizes boredom in combination with detection results and two types of multiparty information, i.e., gaze direction to other participants and incoming-and-outgoing of participants. We experimentally confirmed the effectiveness of both the proposed framework and the multiparty information. In comparison with a simple baseline method, the proposed framework gained 35% points in the F1 score.

Yasuhiro Shibasaki, Kotaro Funakoshi, Koichi Shinoda

Classification of sMRI for AD Diagnosis with Convolutional Neuronal Networks: A Pilot 2-D+ Study on ADNI

In interactive health care systems, Convolutional Neural Networks (CNN) are starting to have their applications, e.g. the classification of structural Magnetic Resonance Imaging (sMRI) scans for Alzheimer’s disease Computer-Aided Diagnosis (CAD). In this paper we focus on the hippocampus morphology which is known to be affected in relation with the progress of the illness. We use a subset of the ADNI (Alzheimer’s Disease Neuroimaging Initiative) database to classify images belonging to Alzheimer’s disease (AD), mild cognitive impairment (MCI) and normal control (NC) subjects. As the number of images in such studies is rather limited regarding the needs of CNN, we propose a data augmentation strategy adapted to the specificity of sMRI scans. We also propose a 2-D+$$\epsilon $$ approach, where only a very limited amount of consecutive slices are used for training and classification. The tests conducted on only one - saggital - projection show that this approach provides good classification accuracies: AD/NC 82.8% MCI/NC 66% AD/MCI 62.5% that are promising for integration of this 2-D+$$\epsilon $$ strategy in more complex multi-projection and multi-modal schemes.

Karim Aderghal, Manuel Boissenin, Jenny Benois-Pineau, Gwenaëlle Catheline, Karim Afdel

Deep Learning for Shot Classification in Gynecologic Surgery Videos

In the last decade, advances in endoscopic surgery resulted in vast amounts of video data which is used for documentation, analysis, and education purposes. In order to find video scenes relevant for aforementioned purposes, physicians manually search and annotate hours of endoscopic surgery videos. This process is tedious and time-consuming, thus motivating the (semi-)automatic annotation of such surgery videos. In this work, we want to investigate whether the single-frame model for semantic surgery shot classification is feasible and useful in practice. We approach this problem by further training of AlexNet, an already pre-trained CNN architecture. Thus, we are able to transfer knowledge gathered from the Imagenet database to the medical use case of shot classification in endoscopic surgery videos. We annotate hours of endoscopic surgery videos for training and testing data. Our results imply that the CNN-based single-frame classification approach is able to provide useful suggestions to medical experts while annotating video scenes. Hence, the annotation process is consequently improved. Future work shall consider the evaluation of more sophisticated classification methods incorporating the temporal video dimension, which is expected to improve on the baseline evaluation done in this work.

Stefan Petscharnig, Klaus Schöffmann

Description Logics and Rules for Multimodal Situational Awareness in Healthcare

We present a framework for semantic situation understanding and interpretation of multimodal data using Description Logics (DL) and rules. More precisely, we use DL models to formally describe contextualised dependencies among verbal and non-verbal descriptors in multimodal natural language interfaces, while context aggregation, fusion and interpretation is supported by SPARQL rules. Both background knowledge and multimodal data, e.g. language analysis results, facial expressions and gestures recognized from multimedia streams, are captured in terms of OWL 2 ontology axioms, the de facto standard formalism of DL models on the Web, fostering reusability, adaptability and interoperability of the framework. The framework has been applied in the eminent field of healthcare, providing the models for the semantic enrichment and fusion of verbal and non-verbal descriptors in dialogue-based systems.

Georgios Meditskos, Stefanos Vrochidis, Ioannis Kompatsiaris

Speech Synchronized Tongue Animation by Combining Physiology Modeling and X-ray Image Fitting

This paper proposes a speech synchronized tongue animation system from text or speech. Firstly, an anatomically accurate physiological tongue model is built, and then produces tremendous tongue deformation samples according to the randomly input muscle activation samples. Secondly, these input and output samples are used to train a neural network for establishing the relationship between the muscle activation and tongue contour deformation. Thirdly, the neural network is used to estimate the non-rigid tongue movement parameters, namely tongue muscle activations, from a collected X-ray tongue movement image database of Mandarin Chinese phonemes after removing the rigid tongue movement, and then the estimation results are used for constructing the tongue physeme (the sequences of the tongue muscle activations and the rigid movement) database corresponding to the Mandarin Chinese phoneme database. Finally, the physemes corresponding to the phonemes extracted from input text or speech are blended to drive the physiological tongue model for producing the speech synchronized tongue animation according to the durations of phonemes. Simulation results demonstrate that the synthesized tongue animations are visually realistic and approximate the tongue medical data well.

Jun Yu

Backmatter

Weitere Informationen