main-content

## Über dieses Buch

The two-volume set LNCS 11961 and 11962 constitutes the thoroughly refereed proceedings of the 25th International Conference on MultiMedia Modeling, MMM 2020, held in Daejeon, South Korea, in January 2020.

Of the 171 submitted full research papers, 40 papers were selected for oral presentation and 46 for poster presentation; 28 special session papers were selected for oral presentation and 8 for poster presentation; in addition, 9 demonstration papers and 6 papers for the Video Browser Showdown 2020 were accepted. The papers of LNCS 11961 are organized in the following topical sections: audio and signal processing; coding and HVS; color processing and art; detection and classification; face; image processing; learning and knowledge representation; video processing; poster papers; the papers of LNCS 11962 are organized in the following topical sections: poster papers; AI-powered 3D vision; multimedia analytics: perspectives, tools and applications; multimedia datasets for repeatable experimentation; multi-modal affective computing of large-scale multimedia data; multimedia and multimodal analytics in the medical domain and pervasive environments; intelligent multimedia security; demo papers; and VBS papers.

## Inhaltsverzeichnis

### Correction to: MultiMedia Modeling

The original version of this book was revised. Due to a technical error, the first volume editor did not appear in the volumes of the MMM 2020 proceedings. A funding number was missing in the acknowledgement section of the chapter titled “AttenNet: Deep Attention Based Retinal Disease Classification in OCT Images.” Both were corrected.

Yong Man Ro, Wen-Huang Cheng, Junmo Kim, Wei-Ta Chu, Peng Cui, Jung-Woo Choi, Min-Chun Hu, Wesley De Neve

### Multi-scale Comparison Network for Few-Shot Learning

Few-shot learning, which learns from a small number of samples, is an emerging field in multimedia. Through systematically exploring influences of scale information, including multi-scale feature extraction, multi-scale comparison and increased parameters brought by multiple scales, in this paper, we present a novel end-to-end model called Multi-scale Comparison Network (MSCN) for few-shot learning. The proposed MSCN uses different scale convolutions for comparison to solve the problem of excessive gaps between target sizes in the images during few-shot learning. It first uses a 4-layer encoder to encode support and testing samples to obtain their feature maps. After deep splicing these feature maps, the proposed MSCN further uses a comparator comprising two layers of multi-scale comparative modules and two fully connected layers to derive the similarity between support and testing samples. Experimental results on two benchmark datasets including Omniglot and $$\textit{mini}$$Imagenet shows the effectiveness of the proposed MSCN, which has averagely 2% improvement on $$\textit{mini}$$Imagenet in all experimental results compared with the recent Relation Network.

Pengfei Chen, Minglei Yuan, Tong Lu

### Semantic and Morphological Information Guided Chinese Text Classification

Recently proposed models such as BERT, perform well in many text processing tasks. They get context-sensitive features, which is a good semantic for word sense disambiguation, through deeper layer and a large number of texts. But, for Chinese text classification, majority of datasets are crawled from social networking sites, these datasets are semantically complex and variable. How much data is needed to pre-train these models in order for them to grasp semantic features and understand context is a question. In this paper, we propose a novel shallow layer language model, which uses sememe information to guide model to grasp semantic information without a large number of pre-trained data. Then, we use the Chinese character representations generated from this model to do text classification. Furthermore, in order to make Chinese as easy to initialize as English, we employ convolution neural networks over Chinese strokes to get Chinese character structure initialization for our model.This model pre-trains on a part of the Chinese Wikipedia dataset, and we use the representations generated by this pre-trained model to do text classification. Experiments on text classification datasets show our model outperforms other state-of-arts models by a large margin. Also, our model is superior in terms of interpretability due to the introduction of semantic and morphological information.

Jiayu Song, Qinghua Xu, Wei Liu, Yueran Zu, Mengdong Chen

### A Delay-Aware Adaptation Framework for Cloud Gaming Under the Computation Constraint of User Devices

Cloud gaming has emerged as a new trend in the gaming industry, bringing a lot of benefits to both players and service providers. In cloud gaming, it is essential to ensure low end-to-end delay for good use experience. Hence, sufficient computational resources must be available at the client in order to process video in a timely manner. However, thin clients such as mobile devices generally have limited computation capabilities. Thus, the available computational resources may be insufficient to support the client, such as in case of low battery. In this paper, we propose a new adaptation framework for resource-constrained cloud gaming clients. The proposed framework combines frame skipping at the server and frame discarding at the client according to available computational resources of the client. Experiment results show that the proposed framework can significantly improve video quality given a delay constraint compared to conventional methods.

Duc V. Nguyen, Huyen T. T. Tran, Truong Cong Thang

### Efficient Edge Caching for High-Quality 360-Degree Video Delivery

360-degree video streaming, which offers an immersive viewing experience, is getting more and more popular. However, delivering 360-degree video streams at scale is challenging due to the extremely large video size and the frequent viewport variations. In order to ease the burden on the network and improve the delivered video quality, we present a caching scheme that places the frequently requested contents at the edge. Our scheme achieves this purpose by presenting: (i) an online learning approach to predict video popularity, and (ii) some strategies for allocating cache for videos and tiles. A thorough evaluation shows that the proposed scheme yields significant traffic reduction over other strategies and does improve the quality of the delivered videos.

Dongbiao He, Jinlei Jiang, Cédric Westphal, Guangwen Yang

### Inferring Emphasis for Real Voice Data: An Attentive Multimodal Neural Network Approach

To understand speakers’ attitudes and intentions in real Voice Dialogue Applications (VDAs), effective emphasis inference from users’ queries may play an important role. However, in VDAs, there are tremendous amount of uncertain speakers with a great diversity of users’ dialects, expression preferences, which challenge the traditional emphasis detection methods. In this paper, to better infer emphasis for real voice data, we propose an attentive multimodal neural network. Specifically, first, beside the acoustic features, extensive textual features are applied in modelling. Then, considering the feature in-dependency, we model the multi-modal features utilizing a Multi-path convolutional neural network (MCNN). Furthermore, combining high-level multi-modal features, we train an emphasis classifier by attending on the textual features with an attention-based bidirectional long short-term memory network (ABLSTM), to comprehensively learn discriminative features from diverse users. Our experimental study based on a real-world dataset collected from Sogou Voice Assistant ( https://yy.sogou.com/ ) show that our method outperforms (over 1.0–15.5% in terms of F1 measure) alternative baselines.

Suping Zhou, Jia Jia, Long Zhang, Yanfeng Wang, Wei Chen, Fanbo Meng, Fei Yu, Jialie Shen

### PRIME: Block-Wise Missingness Handling for Multi-modalities in Intelligent Tutoring Systems

Block-wise missingness in multimodal data poses a challenging barrier for the analysis over it, which is quite common in practical scenarios such as the multimedia intelligent tutoring systems (ITSs). In this work, we collected data from 194 undergraduates via a biology ITS which involves three modalities: student-system logfiles, facial expressions, and eye tracking. However, only 32 out of the 194 students had all three modalities and 83% of them were missing the facial expression data, eye tracking data, or both. To handle such a block-wise missing problem, we propose a Progressively Refined Imputation for Multi-modalities by auto-Encoder (PRIME), which trains the model based on single, pairwise, and entire modalities for imputation in a progressive manner, and therefore enables us to maximally utilize all the available data. We have evaluated PRIME against single-modality log-only (without missingness handling) and five state-of-the-art missing data handling methods on one important yet challenging student modeling task: to predict students’ learning gains. Our results show that using multimodal data as a result of missing data handling yields better prediction performance than using logfiles only, and PRIME outperforms other baseline methods for both learning gain prediction and data reconstruction tasks.

Xi Yang, Yeo-Jin Kim, Michelle Taub, Roger Azevedo, Min Chi

### A New Local Transformation Module for Few-Shot Segmentation

Few-shot segmentation segments object regions of new classes with a few of manual annotations. Its key step is to establish the transformation module between support images (annotated images) and query images (unlabeled images), so that the segmentation cues of support images can guide the segmentation of query images. The existing methods form transformation model based on global cues, which however ignores the local cues that are verified in this paper to be very important for the transformation. This paper proposes a new transformation module based on local cues, where the relationship of the local features is used for transformation. To enhance the generalization performance of the network, the relationship matrix is calculated in a high-dimensional metric embedding space based on cosine distance. In addition, to handle the challenging mapping problem from the low-level local relationships to high-level semantic cues, we propose to apply generalized inverse matrix of the annotation matrix of support images to transform the relationship matrix linearly, which is non-parametric and class-agnostic. The result by the matrix transformation can be regarded as an attention map with high-level semantic cues, based on which a transformation module can be built simply. The proposed transformation module is a general module that can be used to replace the transformation module in the existing few-shot segmentation frameworks. We verify the effectiveness of the proposed method on Pascal VOC 2012 dataset. The value of mIoU achieves at 57.0% in 1-shot and 60.6% in 5-shot, which outperforms the state-of-the-art method by 1.6% and 3.5%, respectively.

Yuwei Yang, Fanman Meng, Hongliang Li, Qingbo Wu, Xiaolong Xu, Shuai Chen

### Background Segmentation for Vehicle Re-identification

Vehicle re-identification (Re-ID) is very important in intelligent transportation and video surveillance. Prior works focus on extracting discriminative features from visual appearance of vehicles or using visual-spatio-temporal information. However, background interference in vehicle re-identification have not been explored. In the actual large-scale spatio-temporal scenes, the same vehicle usually appears in different backgrounds while different vehicles might appear in the same background, which will seriously affect the re-identification performance. To the best of our knowledge, this paper is the first to consider the background interference problem in vehicle re-identification. We construct a vehicle segmentation dataset and develop a vehicle Re-ID framework with a background interference removal (BIR) mechanism to improve the vehicle Re-ID performance as well as robustness against complex background in large-scale spatio-temporal scenes. Extensive experiments demonstrate the effectiveness of our proposed framework, with an average 9% gain on mAP over state-of-the-art vehicle Re-ID algorithms.

Mingjie Wu, Yongfei Zhang, Tianyu Zhang, Wenqi Zhang

### Face Tells Detailed Expression: Generating Comprehensive Facial Expression Sentence Through Facial Action Units

Human facial expression plays the key role in the understanding of the social behavior. Many deep learning approaches present facial emotion recognition and automatic image captioning considering human sentiments. However, most current deep learning models for facial expression analysis do not contain comprehensive, detailed information of a single face. In this paper, we newly introduce a text-based facial expression description using several essential components describing comprehensive facial expression: gender, facial action units, and corresponding intensities. Then, we propose comprehensive facial expression sentence generating model along with facial expression recognition model for a single facial image to verify the effectiveness of our text-based dataset. Experimental results show that the proposed two models are supporting each other improving their performances: the text-based facial expression description provides comprehensive semantic information to the facial emotion recognition model. Also, the visual information from the emotion recognition model guides the facial expression sentence generation to produce a proper sentence describing comprehensive description. The text-based dataset is available at https://github.com/joannahong/Text-based-dataset-with-comprehensive-facial-expression-sentence .

Joanna Hong, Hong Joo Lee, Yelin Kim, Yong Man Ro

### A Deep Convolutional Deblurring and Detection Neural Network for Localizing Text in Videos

Scene text in the video is usually vulnerable to various blurs like those caused by camera or text motions, which brings additional difficulty to reliably extract them from the video for content-based video applications. In this paper, we propose a novel fully convolutional deep neural network for deblurring and detecting text in the video. Specifically, to cope with blur of video text, we propose an effective deblurring subnetwork that is composed of multi-level convolutional blocks with both cross-block (long) and within-block (short) skip connections for progressively learning residual deblurred image details as well as a spatial attention mechanism to pay more attention on blurred regions, which generates the sharper image for current frame by fusing multiple surrounding adjacent frames. To further localize text in the frames, we enhance the EAST text detection model by introducing deformable convolution layers and deconvolution layers, which better capture widely varied appearances of video text. Experiments on the public scene text video dataset demonstrate the state-of-the-art performance of the proposed video text deblurring and detection model.

Yang Wang, Ye Qian, Jiahao Shi, Feng Su

### Generate Images with Obfuscated Attributes for Private Image Classification

Image classification is widely used in various applications and some companies collect a large amount of data from users to train classification models for commercial profitability. To prevent disclosure of private information caused by direct data collecting, Google proposed federated learning to share model parameters rather than data. However, this framework could address the problem of direct data leakage but cannot defend against inference attack, malicious participants can still exploit attribute information from the model parameters.In this paper, we propose a novel method based on StarGAN to generate images with obfuscated attributes. The images generated by our methods can retain the non-private attributes of the original image but protect the specific private attributes of the original image by mixing the original image and the artificial image with obfuscated attributes. Experimental results have shown that the model trained on the artificial image dataset can effectively defend against property inference attack with neglected accuracy loss of classification task in a federated learning environment.

Wei Hou, Dakui Wang, Xiaojun Chen

### Context-Aware Residual Network with Promotion Gates for Single Image Super-Resolution

Deep learning models have achieved significant success in quantities of vision-based applications. However, directly applying deep structures to perform single image super-resolution (SISR) results in poor visual effects such as blurry patches and loss in details, which are caused by the fact that low-frequency information is treated equally and ambiguously across different patches and channels. To ease this problem, we propose a novel context-aware deep residual network with promotion gates, named as G-CASR network, for SISR. In the proposed G-CASR network, a sequence of G-CASR modules is cascaded to transform low-resolution features to high informative features. In each G-CASR module, we also design a dual-attention residual block (DRB) to capture abundant and variant context information by dually connecting spatial and channel attention scheme. To improve the informative ability of extracted context information, a promotion gate (PG) is further applied to analyze inherent characteristics of input data at each module, thus offering insight for how to enhance contributive information and suppress useless information. Experiments on five public datasets consisting of Set5, Set14, B100, Urban100 and Manga109 show that the proposed G-CASR has achieved averagely 1.112/0.0255 improvement for PSNR/SSIM measurements comparing with the recent methods including SRCNN, VDSR, lapSRN and EDSR. Simultaneously, the proposed G-CASR requires only about 25% memory cost comparing with EDSR.

Xiaozhong Ji, Yirui Wu, Tong Lu

### A Compact Deep Neural Network for Single Image Super-Resolution

Convolutional neural network (CNN) has recently been applied into single image super-resolution (SISR) task. But the applied CNN models are increasingly cumbersome which will cause heavy memory and computational burden when deploying in realistic applications. Besides, existing CNNs for SISR have trouble in handling different scales information with same kernel size. In this paper, we propose a compact deep neural network (CDNN) to (1) reduce the amount of model parameters (2) decrease computational operations and (3) process different scales information. We devise two kinds of channel-wise scoring units (CSU), including adaptive channel-wise scoring unit (ACSU) and constant channel-wise scoring unit (CCSU), which act as judges to score for different channels. With further sparsity regularization imposed on CSUs and ensuing pruning of low-score channels, we can achieve considerable storage saving and computation simplification. In addition, the CDNN contains a dense inception structure, the convolutional kernels of which are in different sizes. This enables the CDNN to cope with different scales information in one natural image. We demonstrate the effectiveness of CSUs, dense inception on benchmarks and the proposed CDNN has superior performance over other methods.

Xiaoyu Xu, Jian Qian, Li Yu, Shengju Yu, HaoTao, Ran Zhu

### An Efficient Algorithm of Facial Expression Recognition by TSG-RNN Network

Facial expression recognition remains a challenging problem and the small datasets further exacerbate the task. Most previous works realize facial expression by fine-tuning the network pre-trained on a related domain. They have limitations inevitably. In this paper, we propose an optimal CNN model by transfer learning and fusing three characteristics: spatial, temporal and geometric information. Also, the proposed CNN module is composed of two-fold structures and it can implement a fast training. Evaluation experiments show that the proposed method is comparable to or better than most of the state-of-the-art approaches in both recognition accuracy and training speed.

Kai Huang, Jianjun Li, Shichao Cheng, Jie Yu, Wanyong Tian, Lulu Zhao, Junfeng Hu, Chin-Chen Chang

### Structured Neural Motifs: Scene Graph Parsing via Enhanced Context

Scene graph is one kind of structured representation of the visual content in an image. It is helpful for complex visual understanding tasks such as image captioning, visual question answering and semantic image retrieval. Since the real-world images always have multiple object instances and complex relationships, the context information is extremely important for scene graph generation. It has been noted that the context dependencies among different nodes in the scene graph are asymmetric, which meas it is highly possible to directly predict relationship labels based on object labels but not vice-versa. Based on this finding, the existing motifs network has successfully exploited the context patterns among object nodes and the dependencies between the object nodes and the relation nodes. However, the spatial information and the context dependencies among relation nodes are neglected. In this work, we propose Structured Motif Network (StrcMN) which predicts object labels and pairwise relationships by mining more complete global context features. The experiments show that our model significantly outperforms previous methods on the VRD and Visual Genome datasets.

Yiming Li, Xiaoshan Yang, Changsheng Xu

### Perceptual Localization of Virtual Sound Source Based on Loudspeaker Triplet

When using a loudspeaker triplet for virtual sound localization, the traditional conversion method will result in inaccurate localization. In this paper, we constructed a perceptual localization distortion model based on the basic principle of binaural perception sound source localization and relying on the known PKU HRTFs database. On this basic, the perceptual localization errors of virtual sources were calculated by using PKU HRTFs. After analyzing the perceptual localization errors of virtual sources reproduced by loudspeaker triplets, it was found that the main influence factor, i.e., the convergence angle of the loudspeaker triplet, could constrain the perceptual localization distortion. Simulation and subjective evaluation experiments indicate that the proposed selection method outperforms the traditional method, and that the proposed method can be successfully applied to perceptual localization of the moving virtual source.

Duanzheng Guan, Dengshi Li, Xuebei Cai, Xiaochen Wang, Ruimin Hu

### TK-Text: Multi-shaped Scene Text Detection via Instance Segmentation

Benefit from the development of deep neural networks, scene text detectors have progressed rapidly over the past few years and achieved outstanding performance on several standard benchmarks. However, most existing methods adopt quadrilateral bounding boxes to represent texts, which are usually inadequate to deal with multi-shaped texts such as the curved ones. To keep consist detection performance on both quadrilateral and curved texts, we present a novel representation, i.e., text kernel, for multi-shaped texts. On the basis of text kernel, we propose a simple yet effective scene text detection method, named as TK-Text. The proposed method consists of three steps, namely text-context-aware network, segmentation map generation and text kernel based post-clustering. During text-context-aware network, we construct a segmentation-based network to extract feature map from natural scene images, which are further enhanced with text context information extracted from an attention scheme TKAB. In segmentation map generation, text kernels and rough boundaries of text instances are segmented based on the enhanced feature map. Finally, rough text instances are gradually refined to generate accurate text instances by performing clustering based on text kernel. Experiments on public benchmarks including SCUT-CTW1500, ICDAR 2015 and ICDAR 2017 MLT demonstrate that the proposed method achieves competitive detection performance comparing with the existing methods.

Xiaoge Song, Yirui Wu, Wenhai Wang, Tong Lu

### More-Natural Mimetic Words Generation for Fine-Grained Gait Description

A mimetic word is used to verbally express the manner of a phenomenon intuitively. The Japanese language is known to have a greater number of mimetic words in its vocabulary than most other languages. Especially, since human gaits are one of the most commonly represented behavior by mimetic words in the language, we consider that it should be suitable for labels of fine-grained gait recognition. In addition, Japanese mimetic words have a more decomposable structure than these in other languages such as English. So it is said that they have sound-symbolism and their phonemes are strongly related to the impressions of various phenomena. Thanks to this, native Japanese speakers can express their impressions on them briefly and intuitively using various mimetic words. Our previous work proposed a framework to convert the body-parts movements to an arbitrary mimetic word by a regression model. The framework introduced a “phonetic space” based on sound-symbolism, and it enabled fine-grained gait description using the generated mimetic words consisting of an arbitrary combination of phonemes. However, this method did not consider the “naturalness” of the description. Thus, in this paper, we propose an improved mimetic word generation module considering its naturalness, and update the description framework. Here, we define the co-occurrence frequency of phonemes composing a mimetic word as the naturalness. To investigate the co-occurrence frequency, we collected many mimetic words through a subjective experiment. As a result of evaluation experiments, we confirmed that the proposed module could describe gaits with more natural mimetic words while maintaining the description accuracy.

Hirotaka Kato, Takatsugu Hirayama, Ichiro Ide, Keisuke Doman, Yasutomo Kawanishi, Daisuke Deguchi, Hiroshi Murase

### Lite Hourglass Network for Multi-person Pose Estimation

Recent multi-person pose estimation networks rely on sequential downsampling and upsampling procedures to capture multi-scale features and stacking basic modules to reassess local and global contexts. However, the network parameters become huge and difficult to be trained under limited computational resource. Motived by this observation, we design a lite version of Hourglass module that uses hybrid convolution blocks to reduce the number of parameters while maintaining performance. The hybrid convolution block builds multi-context paths with dilated convolutions with different rates which not only reduces the number of parameters but also enlarges the receptive field. Moreover, due to the limitation of heatmap representation, the networks need extra and non-differentiable post-processing to convert heatmaps to keypoint coordinates. Therefore, we propose a simple and efficient operation based on integral loss to fill this gap specifically for bottom-up pose estimation methods. We demonstrate that the proposed approach achieves better performance than the baseline methods on the challenge benchmark MSCOCO dataset for multi-person pose estimation.

Ying Zhao, Zhiwei Luo, Changqin Quan, Dianchao Liu, Gang Wang

### Single View Depth Estimation via Dense Convolution Network with Self-supervision

Depth estimation from single image by deep learning is a hot topic of research nowadays. Existing methods mainly focus on learning neural network supervised by ground truth. This paper proposes a method for single view depth estimation based on convolution neural network with self-supervision. Firstly, a modified dense encoder-decoder architecture is employed to predict the disparity maps of image which can then be converted into depth and only one single image is fed to predict depth at test time. Secondly, the stereo pairs without ground truth are used as samples to generate supervision signals by synthesizing the predicted results during network training, which is referenced to network training in self-supervision manner. Finally, a novel loss function is defined which considers not only the similarity between the stereo and synthesized images, but also the inconsistency between the predicted disparities, which can decrease the influence of illumination of images. Experimental comparisons against the state-of-the-art both supervised and unsupervised methods on two public datasets prove that the proposed method performs very well for single view depth estimation.

Yunhan Sun, Jinlong Shi, Suqin Bai, Qiang Qian, Zhengxing Sun

### Multi-data UAV Images for Large Scale Reconstruction of Buildings

In this paper, a new energy function is proposed that can aggregate the mesh model generated by the point cloud extracted from the UAV and supplement it with contextual semantics to accurately segment the building, which maximizes the consistency of the extracted buildings to restore detail. The semantic information is also used to improve the consistency of the labels between the semantic segments of the extracted input model to ensure the validity of the separation results. A new method of reconstructing polygon and arc models using unstructured models is proposed to improve large scale reconstruction. It can robustly discover the set of adjacency relations and repairs appropriately the non-watertight model due to point cloud loss. The experimental results show that the proposed large scale reconstruction algorithm is suitable for the modeling of complex urban buildings.

Menghan Zhang, Yunbo Rao, Jiansu Pu, Xun Luo, Qifei Wang

### Deformed Phase Prediction Using SVM for Structured Light Depth Generation

In phase-based structured light, absolute phase unwrapping, which is a cumbersome step, is often considered necessary before calculating depth. In this paper, we notice that depth is only related to the deformed phase but not the absolute unwrapped phase. Furthermore, the deformed phase is highly related to the changes of the wrapped reference and captured phases. Based on these findings, we propose a classification-based scheme that can directly report deformed phase. To be specific, we cast the problem of inferring fringe order difference as a multi-class classification task, where phase samples within half a period are fed to the classifier and the fringe oder difference is the class. Besides, we use a radial basis function support vector machine as the classifier. In such a manner, for every pixel, the deformed phase can be obtained directly without knowing the absolute unwrapped phase. Moreover, the proposed method only needs phase from a single frequency and is pixel-independent, so it is free from troubles such as poor real-time performance in temporal unwrapping or error accumulation in spatial unwrapping. Experiments on 3dsmax data and real-captured data prove that the proposed method can produce high quality depth maps.

Sen Xiang, Qiong Liu, Huiping Deng, Jin Wu, Li Yu

### Extraction of Multi-class Multi-instance Geometric Primitives from Point Clouds Using Energy Minimization

Point clouds play a vital role in self-driving vehicle, interactive media and other applications. However, how to efficiently and robustly extract multiple geometric primitives from point clouds is still a challenge. In this paper, a novel algorithm for extracting multiple instances of multiple classes of geometric primitives is proposed. First, a new sampling strategy is applied to generate model hypotheses. Next, an energy function is formulated from the view of point labelling. Then, an improved optimization technique is used to minimize the energy. After that, refine hypotheses and parameters. Iterate this process until the energy does not decrease. Finally, multi-class multi-instance of geometric primitives are correctly and robustly extracted. Different to existing methods, the type and number of models can be automatically determined. Experimental results validate the proposed algorithm.

Liang Wang, Biying Yan, Fuqing Duan, Ke Lu

### Similarity Graph Convolutional Construction Network for Interactive Action Recognition

Interaction action recognition is a challenging problem in the research of computer vision. Skeleton-based action recognition shows great performance in recent years, but the non-Euclidean distance structure of the skeleton brings a huge challenge to the design of deep learning neural network. When meeting interaction action recognition, research in the previous study is based on a fixed skeleton graph, capturing only information about local body movements in a single action and do not deal with the relationship between two or more people. In this article, we present a similarity graph convolutional network that contains two-person interaction information. This model can represent the relationship between two people. Simultaneously, for different body parts (such as head and hand), the relationship can be handled. The model has two construction modes, a skeleton graph and a similarity graph, and the features from the two composition modes is better fused by the hypergraph. Similarity graph is obtained from a two-step construction. First, an encoder is designed, which is aimed to map different characteristics of one joint to a same vector space. Second, we calculate the similarity between different joints to construct the similarity graph. Follow the steps above, similarity graph can indicate the relationship between two people in details. We perform experiments on the NTU RGB+D dataset and verify the effectiveness of our model. The result shows that our approach outperforms the state-of-the-art methods and similarity graph can solve the relationship modeling problem in interactive action recognition.

Xiangyu Sun, Qiong Liu, You Yang

### Content-Aware Cubemap Projection for Panoramic Image via Deep Q-Learning

Cubemap projection (CMP) becomes a potential panoramic data format for its efficiency. However, default CMP coordinate system with fixed viewpoint may cause distortion, especially around the boundaries of each projection plane. To promote quality of panoramic images in CMP, we propose a content-awared CMP optimization method via deep Q-learning. The key of this method is to predict an angle for rotating the image in Equirectangular projection (ERP), which attempts to keep foreground objects away from the edge of each projection plane after the image is re-projected with CMP. Firstly, the panoramic image in ERP is preprocessed for obtaining a foreground pixel map. Secondly, we feed the foreground map into the proposed deep convolutional network (ConvNet) to obtain the predicted rotation angle. The model parameters are training through the deep Q-learning scheme. Experimental results show our method keep more foreground pixels in center of each projection plane than the baseline.

Zihao Chen, Xu Wang, Yu Zhou, Longhao Zou, Jianmin Jiang

### Robust RGB-D Data Registration Based on Correntropy and Bi-directional Distance

The iterative closest point (ICP) algorithm is most widely used for rigid registration of point sets. In this paper, a robust ICP registration method data is proposed to register RGB-D data. Firstly, the color information is introduced to build more precise correspondence between two point sets. Secondly, to enhance the robustness of the algorithm to noise and outliers, the maximum correntropy criterion (MCC) is introduced to the registration framework. Thirdly, to reduce the possibility of the algorithm falling into local minimum and deal with ill-pose issue, the bidirectional distance measurement is added to the proposed algorithm. Finally, the experimental results of point sets registration and scene reconstruction demonstrate that the proposed algorithm can obtain more precise and robust results than other ICP algorithms.

Teng Wan, Shaoyi Du, Wenting Cui, Qixing Xie, Yuying Liu, Zuoyong Li

### InSphereNet: A Concise Representation and Classification Method for 3D Object

In this paper, we present an InSphereNet method for the problem of 3D object classification. Unlike previous methods that use points, voxels, or multi-view images as inputs of deep neural network (DNN), the proposed method constructs a class of more representative features named infilling spheres from signed distance field (SDF). Because of the admirable spatial representation of infilling spheres, we can not only utilize very fewer number of spheres to accomplish classification task, but also design a lightweight InSphereNet with less layers and parameters than previous methods. Experiments on ModelNet40 show that the proposed method leads to superior performance than PointNet in accuracy. In particular, if there are only a few dozen sphere inputs or about 100000 DNN parameters, the accuracy of our method remains at a very high level.

Hui Cao, Haikuan Du, Siyu Zhang, Shen Cai

### 3-D Oral Shape Retrieval Using Registration Algorithm

In this paper, we present a novel 3-D oral shape retrieval using correntropy-based registration algorithm. Fast matching as the traditional registration method can achieve, its registration accuracy is disturbed by noise and outliers. Since the 3-D oral model contains a large amount of noise and outliers, it may lead to a decrease in registration accuracy, which affects the accuracy of retrieval rate. Therefore, we introduce the correntropy into the rigid registration algorithm to solve this problem. Although the noise and outliers are suppressed by the correntropy-based algorithm, these noises and outliers still participate in the registration. For better retrieval, we choose the matched point cloud data and use mean squared error results to judge the individual differences of the shape. Finally, the accurate retrieval of the oral shape is realized. Experimental results demonstrate our 3-D shape retrieval algorithm can be successfully searched under different models, which can help forensics use the characteristics of biological individuals to accurately search and identify, and improve recognition efficiency.

Wenting Cui, Shaoyi Du, Teng Wan, Yan Liu, Yuying Liu, Yang Yang, Qingnan Mou, Mengqi Han, Yu-cheng Guo

### Face Super-Resolution by Learning Multi-view Texture Compensation

Single face image super-resolution (SR) methods using deep neural network yields decent performance. Due to the posture of face images, multi-view face super-resolution task is more challenging than single input. Multi-view face images contain complement information from different view. However, it is hard to integrate texture information from multi-view low-resolution (LR) face images. In this paper, we propose a novel face SR using multi-view texture compensation to combine multiple face images to yield a HR image as output. We use texture attention mechanism to transfer high-accurate texture compensation information to fixed view for better visual performance. Experimental results conform that the proposed neural network outperforms other state-of-the-art face SR algorithms.

Yu Wang, Tao Lu, Ruobo Xu, Yanduo Zhang

### Light Field Salient Object Detection via Hybrid Priors

In this paper, we propose a salient object detection model on light field via hybird priors. The proposed model extracts four feature maps, including region contrast, background prior, depth prior and surface orientation prior maps. After that, the priors fusion stage is implemented to obtain and optimize the final salient object map. To verify the validity of the proposed model, comprehensive performance evaluation and comparative analysis are conducted on the public datasets LFSD and HFUT-Lytro. Experimental results show that the proposed method is superior to the existing light field saliency object detection model on the public two datasets.

Junlin Zhang, Xu Wang

### Multimedia Analytics Challenges and Opportunities for Creating Interactive Radio Content

Werner Bailer, Maarten Wijnants, Hendrik Lievens, Sandy Claes

### Interactive Search and Exploration in Discussion Forums Using Multimodal Embeddings

In this paper we present a novel interactive multimodal learning system, which facilitates search and exploration in large networks of social multimedia users. It allows the analyst to identify and select users of interest, and to find similar users in an interactive learning setting. Our approach is based on novel multimodal representations of users, words and concepts, which we simultaneously learn by deploying a general-purpose neural embedding model. The usefulness of the approach is evaluated using artificial actors, which simulate user behavior in a relevance feedback scenario. Multiple experiments were conducted in order to evaluate the quality of our multimodal representations and compare different embedding strategies. We demonstrate the capabilities of the proposed approach on a multimedia collection originating from the violent online extremism forum Stormfront, which is particularly interesting due to the high semantic level of the discussions it features.

Iva Gornishka, Stevan Rudinac, Marcel Worring

### An Inverse Mapping with Manifold Alignment for Zero-Shot Learning

Zero-shot learning aims to recognize objects from unseen classes, where samples are not available at the training stage, by transferring knowledge from seen classes, where labeled samples are provided. It bridges seen and unseen classes via a shared semantic space such as class attribute space or class prototype space. While previous approaches have tried to learning a mapping function from the visual space to the semantic space with different objective functions, we take a different approach and try to map from the semantic space to the visual space. The inverse mapping predicts the visual feature prototype of each unseen class via the semantic vector for image classification. We also propose a heuristic algorithm to select a high density set from data of each seen class. The visual feature prototypes from the high density sets are more discriminative, which is benefit to the classification. Our approach is evaluated for zero-shot recognition on four benchmark data sets and significantly outperforms the state-of-the-art methods on AWA, SUN, APY.

Xixun Wu, Binheng Song, Zhixiang Wang, Chun Yuan

### Baseline Analysis of a Conventional and Virtual Reality Lifelog Retrieval System

Continuous media capture via a wearable devices is currently one of the most popular methods to establish a comprehensive record of the entirety of an individual’s life experience, referred to in the research community as a lifelog. These vast multimodal corpora include visual and other sensor data and are enriched by content analysis, to generate as extensive a record of an individual’s life experience. However, interfacing with such datasets remains an active area of research, and despite the advent of new technology and a plethora of competing mediums for processing digital information, there has been little focus on newly emerging platforms such as virtual reality. In this work, we suggest that the increase in immersion and spatial dimensions provided by virtual reality could provide significant benefits to users when compared to more conventional access methodologies. Hence, we motivate virtual reality as a viable method of exploring multimedia archives (specifically lifelogs) by performing a baseline comparative analysis using a novel application prototype built for the HTC Vive and a conventional prototype built for a standard personal computer.

Aaron Duane, Cathal Gurrin

### An Extensible Framework for Interactive Real-Time Visualizations of Large-Scale Heterogeneous Multimedia Information from Online Sources

This work presents the user-centered design and development of a generic and extensible visualization framework that can be re-used in various scenarios in order to communicate large–scale heterogeneous multimedia information obtained from social media and Web sources, through user-friendly interactive visualizations in real-time. Using the particular framework as a basis, two Web-based dashboards demonstrating the visual analytics components of our framework have been developed. Additionally, three indicative use case scenarios where these dashboards can be employed are described. Finally, preliminary user feedback and improvements are discussed, and directions for further development are proposed on the basis of the findings.

Aikaterini Katmada, George Kalpakis, Theodora Tsikrika, Stelios Andreadis, Stefanos Vrochidis, Ioannis Kompatsiaris

### GLENDA: Gynecologic Laparoscopy Endometriosis Dataset

Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient’s abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) – an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.

Andreas Leibetseder, Sabrina Kletz, Klaus Schoeffmann, Simon Keckstein, Jörg Keckstein

### Kvasir-SEG: A Segmented Polyp Dataset

Pixel-wise image segmentation is a highly demanding task in medical-image analysis. In practice, it is difficult to find annotated medical images with corresponding segmentation masks. In this paper, we present Kvasir-SEG: an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist. Moreover, we also generated the bounding boxes of the polyp regions with the help of segmentation masks. We demonstrate the use of our dataset with a traditional segmentation approach and a modern deep-learning based Convolutional Neural Network (CNN) approach. The dataset will be of value for researchers to reproduce results and compare methods. By adding segmentation masks to the Kvasir dataset, which only provide frame-wise annotations, we enable multimedia and computer vision researchers to contribute in the field of polyp segmentation and automatic analysis of colonoscopy images.

Debesh Jha, Pia H. Smedsrud, Michael A. Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, Håvard D. Johansen

### Rethinking the Test Collection Methodology for Personal Self-tracking Data

While vast volumes of personal data are being gathered daily by individuals, the MMM community has not really been tackling the challenge of developing novel retrieval algorithms for this data, due to the challenges of getting access to the data in the first place. While initial efforts have taken place on a small scale, it is our conjecture that a new evaluation paradigm is required in order to make progress in analysing, modeling and retrieving from personal data archives. In this position paper, we propose a new model of Evaluation-as-a-Service that re-imagines the test collection methodology for personal multimedia data in order to address the many challenges of releasing test collections of personal multimedia data.

Frank Hopfgartner, Cathal Gurrin, Hideo Joho

### Experiences and Insights from the Collection of a Novel Multimedia EEG Dataset

There is a growing interest in utilising novel signal sources such as EEG (Electroencephalography) in multimedia research. When using such signals, subtle limitations are often not readily apparent without significant domain expertise. Multimedia research outputs incorporating EEG signals can fail to be replicated when only minor modifications have been made to an experiment or seemingly unimportant (or unstated) details are changed. This can lead to overoptimistic or overpessimistic viewpoints on the potential real-world utility of these signals in multimedia research activities. This paper describes an EEG/MM dataset and presents a summary of distilled experiences and knowledge gained during the preparation (and utilisiation) of the dataset that supported a collaborative neural-image labelling benchmarking task. The goal of this task was to collaboratively identify machine learning approaches that would support the use of EEG signals in areas such as image labelling and multimedia modeling or retrieval. The contributions of this paper can be listed thus; a template experimental paradigm is proposed (along with datasets and a baseline system) upon which researchers can explore multimedia image labelling using a brain-computer interface, learnings regarding commonly encountered issues (and useful signals) when conducting research that utilises EEG in multimedia contexts are provided, and finally insights are shared on how an EEG dataset was used to support a collaborative neural-image labelling benchmarking task and the valuable experiences gained.

Graham Healy, Zhengwei Wang, Tomas Ward, Alan Smeaton, Cathal Gurrin

### Relation Modeling with Graph Convolutional Networks for Facial Action Unit Detection

Most existing AU detection works considering AU relationships are relying on probabilistic graphical models with manually extracted features. This paper proposes an end-to-end deep learning framework for facial AU detection with graph convolutional network (GCN) for AU relation modeling, which has not been explored before. In particular, AU related regions are extracted firstly, latent representations full of AU information are learned through an auto-encoder. Moreover, each latent representation vector is feed into GCN as a node, the connection mode of GCN is determined based on the relationships of AUs. Finally, the assembled features updated through GCN are concatenated for AU detection. Extensive experiments on BP4D and DISFA benchmarks demonstrate that our framework significantly outperforms the state-of-the-art methods for facial AU detection. The proposed framework is also validated through a series of ablation studies.

Zhilei Liu, Jiahui Dong, Cuicui Zhang, Longbiao Wang, Jianwu Dang

### Enhanced Gaze Following via Object Detection and Human Pose Estimation

The aim of gaze following is to estimate the gaze direction, which is useful for the understanding of human behaviour in various applications. However, it is still an open problem that has not been fully studied. In this paper, we present a novel framework for gaze following problem, where both the front/side face case and the back face case are taken into account. For the front/side face case, head pose estimation is applied to estimate the gaze, and then object detection is used to further refine the gaze direction by selecting the object that intersects with the gaze in a certain range. For the back face case, a deep neural network with the human pose information is proposed for gaze estimation. Experiments are carried out to demonstrate the superiority of the proposed method, as compared with the state-of-the-art method.

Jian Guan, Liming Yin, Jianguo Sun, Shuhan Qi, Xuan Wang, Qing Liao

### Region Based Adversarial Synthesis of Facial Action Units

Facial expression synthesis or editing has recently received increasing attention in the field of affective computing and facial expression modeling. However, most existing facial expression synthesis works are limited in paired training data, low resolution, identity information damaging, and so on. To address those limitations, this paper introduces a novel Action Unit (AU) level facial expression synthesis method called Local Attentive Conditional Generative Adversarial Network (LAC-GAN) based on face action units annotations. Given desired AU labels, LAC-GAN utilizes local AU regional rules to control the status of each AU and attentive mechanism to combine several of them into the whole photo-realistic facial expressions or arbitrary facial expressions. In addition, unpaired training data is utilized in our proposed method to train the manipulation module with the corresponding AU labels, which learns a mapping between a facial expression manifold. Extensive qualitative and quantitative evaluations are conducted on commonly used BP4D dataset to verify the effectiveness of our proposed AU synthesis method.

Zhilei Liu, Diyi Liu, Yunpeng Wu

### Facial Expression Restoration Based on Improved Graph Convolutional Networks

Facial expression analysis in the wild is challenging when the facial image is with low resolution or partial occlusion. Considering the correlations among different facial local regions under different facial expressions, this paper proposes a novel facial expression restoration method based on generative adversarial network by integrating an improved graph convolutional network (IGCN) and region relation modeling block (RRMB). Unlike conventional graph convolutional networks taking vectors as input features, IGCN can use tensors of face patches as inputs. It is better to retain the structure information of face patches. The proposed RRMB is designed to address facial generative tasks including inpainting and super-resolution with facial action units detection, which aims to restore facial expression as the ground-truth. Extensive experiments conducted on BP4D and DISFA benchmarks demonstrate the effectiveness of our proposed method through quantitative and qualitative evaluations.

Zhilei Liu, Le Li, Yunpeng Wu, Cuicui Zhang

### Global Affective Video Content Regression Based on Complementary Audio-Visual Features

In this paper, we propose a new framework for global affective video content regression with five complementary audio-visual features. For the audio modality, we select the global audio feature eGeMAPS and two deep features SoundNet and VGGish. As for the visual modality, the key frames of original images and those of optical flow images are both used to extract VGG-19 features with finetuned models, in order to represent the original visual cues in conjunction with motion information. In the experiments, we perform the evaluations of selected audio and visual features on the dataset of Emotional Impact of Movies Task 2016 (EIMT16), and compare our results with those of competitive teams in EIMT16 and state-of-the-art method. The experimental results show that the fusion of five features can achieve better regression results in both arousal and valence dimensions, indicating the selected five features are complementary with each other in the audio-visual modalities. Furthermore, the proposed approach can achieve better regression results than the state-of-the-art method in both evaluation metrics of MSE and PCC in the arousal dimension and comparable MSE results in the valence dimension. Although our approach obtains slightly lower PCC result than the state-of-the-art method in the valence dimension, the fused feature vectors used in our framework have much lower dimensions with a total of 1752, only five thousandths of feature dimensions in the state-of-the-art method, largely bringing down the memory requirements and computational burden.

Xiaona Guo, Wei Zhong, Long Ye, Li Fang, Yan Heng, Qin Zhang

### Studying Public Medical Images from the Open Access Literature and Social Networks for Model Training and Knowledge Extraction

Henning Müller, Vincent Andrearczyk, Oscar Jimenez del Toro, Anjani Dhrangadhariya, Roger Schaer, Manfredo Atzori

### AttenNet: Deep Attention Based Retinal Disease Classification in OCT Images

An optical coherence tomography (OCT) image is becoming the standard imaging modality in diagnosing retinal diseases and the assessment of their progression. However, the manual evaluation of the volumetric scan is time consuming, expensive and the signs of the early disease are easy to miss. In this paper, we mainly present an attention-based deep learning method for the retinal disease classification in OCT images, which can assist the large-scale screening or the diagnosis recommendation for an ophthalmologist. First, according to the unique characteristic of a retinal OCT image, we design a customized pre-processing method to improve image quality. Second, in order to guide the network optimization more effectively, a specially designed attention model, which pays more attention to critical regions containing pathological anomalies, is integrated into a typical deep learning network. We evaluate our proposed method on two data sets, and the results consistently show that it outperforms the state-of-the-art methods. We report an overall four-class accuracy of 97.4%, a two-class sensitivity of 100.0%, and a two-class specificity of 100.0% on a public data set shared by Zhang et al. with 1,000 testing B-scans in four disease classes. Compared to their work, our method improves the numbers by 0.8%, 2.2%, and 2.6% respectively.

Jun Wu, Yao Zhang, Jie Wang, Jianchun Zhao, Dayong Ding, Ningjiang Chen, Lingling Wang, Xuan Chen, Chunhui Jiang, Xuan Zou, Xing Liu, Hui Xiao, Yuan Tian, Zongjiang Shang, Kaiwei Wang, Xirong Li, Gang Yang, Jianping Fan

### NOVA: A Tool for Explanatory Multimodal Behavior Analysis and Its Application to Psychotherapy

In this paper, we explore the benefits of our next-generation annotation and analysis tool NOVA in the domain of psychotherapy. The NOVA tool has been developed, tested and applied in behaviour studies for several years and psychotherapy sessions offer a great way to expand areas of application into a challenging yet promising field. In such scenarios, interactions with patients are often rated by questionnaires and the therapist’s subjective rating, yet a qualitative analysis of the patient’s non-verbal behaviours can only be performed in a limited way as this is very expensive and time-consuming. A main aspect of NOVA is the possibility of applying semi-supervised active learning where Machine Learning techniques are already used during the annotation process by giving the possibility to pre-label data automatically. Furthermore, NOVA provides therapists with a confidence value of the automatically predicted annotations. This way, also non-ML experts get to understand whether they can trust their ML models for the problem at hand.

Tobias Baur, Sina Clausen, Alexander Heimerl, Florian Lingenfelser, Wolfgang Lutz, Elisabeth André

### Instrument Recognition in Laparoscopy for Technical Skill Assessment

Laparoscopic skill training and evaluation as well as identifying technical errors in surgical procedures have become important aspects in Surgical Quality Assessment (SQA). Typically performed in a manual, time-consuming and effortful post-surgical process, evaluating technical skills for a large part involves assessing proper instrument handling as the main cause for these type of errors. Therefore, when attempting to improve upon this situation using computer vision approaches, the automatic identification of instruments in laparoscopy videos is the very first step toward a semi-automatic assessment procedure. Within this work we summarize existing methodologies for instrument recognition, while proposing a state-of-the-art instance segmentation approach. As a first experiment in the domain of gynecology, our approach is able to segment instruments well but a much higher precision will be required, since this early step is critical before attempting any kind of skill recognition.

Sabrina Kletz, Klaus Schoeffmann, Andreas Leibetseder, Jenny Benois-Pineau, Heinrich Husslein

### Real-Time Recognition of Daily Actions Based on 3D Joint Movements and Fisher Encoding

Recognition of daily actions is an essential part of Ambient Assisted Living (AAL) applications and still not fully solved. In this work, we propose a novel framework for the recognition of actions of daily living from depth-videos. The framework is based on low-level human pose movement descriptors extracted from 3D joint trajectories as well as differential values that encode speed and acceleration information. The joints are detected using a depth sensor. The low-level descriptors are then aggregated into discriminative high-level action representations by modeling prototype pose movements with Gaussian Mixtures and then using a Fisher encoding schema. The resulting Fisher vectors are suitable to train Linear SVM classifiers so as to recognize actions in pre-segmented video clips, alleviating the need for additional parameter search with non-linear kernels or neural network tuning. Experimental evaluation on two well-known RGB-D action datasets reveal that the proposed framework achieves close to state-of-the-art performance whilst maintaining high processing speeds.

Panagiotis Giannakeris, Georgios Meditskos, Konstantinos Avgerinakis, Stefanos Vrochidis, Ioannis Kompatsiaris

### Model-Based and Class-Based Fusion of Multisensor Data

In the recent years, the advancement of technology, the constantly aging population and the developments in medicine have resulted in the creation of numerous ambient assisted living systems. Most of these systems consist of a variety of sensors that provide information about the health condition of patients, their activities and also create alerts in case of harmful events. Successfully combining and utilizing all the multimodal information is an important research topic. The current paper compares model-based and class-based fusion, in order to recognize activities by combining data from multiple sensors or sensors of different body placements. More specifically, we tested the performance of three fusion methods; weighted accuracy, averaging and a recently introduced detection rate based fusion method. Weighted accuracy and the detection rate based fusion achieved the best performance in most of the experiments.

Athina Tsanousa, Angelos Chatzimichail, Georgios Meditskos, Stefanos Vrochidis, Ioannis Kompatsiaris

### Evaluating the Generalization Performance of Instrument Classification in Cataract Surgery Videos

In the field of ophthalmic surgery, many clinicians nowadays record their microscopic procedures with a video camera and use the recorded footage for later purpose, such as forensics, teaching, or training. However, in order to efficiently use the video material after surgery, the video content needs to be analyzed automatically. Important semantic content to be analyzed and indexed in these short videos are operation instruments, since they provide an indication of the corresponding operation phase and surgical action. Related work has already shown that it is possible to accurately detect instruments in cataract surgery videos. However, their underlying dataset (from the CATARACTS challenge) has very good visual quality, which is not reflecting the typical quality of videos acquired in general hospitals. In this paper, we therefore analyze the generalization performance of deep learning models for instrument recognition in terms of dataset change. More precisely, we trained such models as ResNet-50, Inception v3 and NASNet Mobile using a dataset of high visual quality (CATARACTS) and test it on another dataset with low visual quality (Cataract-101), and vice versa. Our results show that the generalizability is surprisingly low in general, but slightly worse for the model trained on the high-quality dataset.

Natalia Sokolova, Klaus Schoeffmann, Mario Taschwer, Doris Putzgruber-Adamitsch, Yosuf El-Shabrawi

### Compact Position-Aware Attention Network for Image Semantic Segmentation

In intelligent multimedia security, automatic image semantic segmentation is a fundamental research, which facilitates to accurately recognizing important targets from multimedia data and performing subsequent security analysis. Most existing semantic segmentation methods have made remarkable progress via modeling interactions between image pixels based on fully convolutional networks (FCN). However, they neglect the fact that semantic features extracted by FCN have poor ability to represent original image details, which always makes it hard to attend true positive relevant information within adjacent regions in spatial position for interactions modeling based methods. To tackle above problem, we take position information into account and adaptively model position relevance between pixels for enhancing local consistent in segmentation results. We propose a novel compact position-aware attention network (CPANet), containing spatial augmented attention module and channel augmented attention module, to simultaneously learn semantic relevance and position relevance between image pixels in a mutually reinforced way. In spatial augmented module, we introduce relative height and width distance to model position relevance based on self-attention mechanism. In channel augmented module, we exploit bilinear pooling to model compact correlation between pixels at any position across any channels. Our proposed CPANet can mutually learn accurate position and semantic of image pixels in a compact manner for improving semantic segmentation performance. Experimental results demonstrate that our approach has achieved the state-of-the-art performance in Cityscapes dataset.

Yajun Xu, Zhendong Mao, Peng Zhang, Bin Wang

### Law Is Order: Protecting Multimedia Network Transmission by Game Theory and Mechanism Design

Nowadays, the computer network plays as the most important medium in the transmitting of multimedia. Correspondingly, the orderliness of network is the protection of multimedia transmission. However, due to the Packet Switching design, the network can only provide best-effort service, in which the multimedia applications compete for its network resource. In the lawless competition, the applications are obliged to be greedy and deceptive driven by their best self-interest. As a result, the transmission of multimedia applications becomes disorderly and inefficiently, and we lose the protection of multimedia network transmission. In this paper, we first investigate the behaviors of multimedia applications with Game Theory, and summarize the disorderly transmission as a Prisoner’s Dilemma. The lost of law leads to the disorderly transmission in multimedia network. Then, we investigate the relationship between application’s media-attribute and its required transmission service, and resolve this Prisoner’s Dilemma with Mechanism Design. Specifically, a novel Media-attribute Switching (MAS) is proposed, where the network allocate the transmission resources according to the application’s claimed media-attribute. In MAS, differentiated services are provided to applications with different claimed media-attribute. We design MAS to have the honesty application gets compatible service while the deceptive application gets incompatible service. Therefore, the MAS can provide incentives for multimedia applications to label their data media-attributes honestly and allocate the network resources according to their attributes, thus to protect the multimedia network transmission. The theoretical analysis and experimental comparison both prove our MAS’s protection to the multimedia network.

Chuanbin Liu, Youliang Tian, Hongtao Xie

### Rational Delegation Computing Using Information Theory and Game Theory Approach

Delegation computing is a calculation protocol between non-cooperative participants, and its results are influenced by the participant’s choice of behavior. The goal of this paper is to solve the problem of high communication overhead in traditional delegation computing schemes. Combining the advantages of information theory and game theory, we propose a rational delegation computing scheme, which guarantees the correctness of the calculation results through the participant utility function. First, by analyzing the participant behavior strategy, we design the game model, which includes the participant set, information set, behavior strategy set and utility function. Second, according to the combination of Nash equilibrium and channel capacity limit in the game model, we construct a rational delegation computing scheme in this paper. Finally, we analyze and prove the scheme. When both the delegation party and computing party choose the honesty strategy, their utility reaches the maximum, that is, the global can reach the Nash equilibrium state, and the calculation efficiency has also been improved.

Qiuxian Li, Youliang Tian

### Multi-hop Interactive Cross-Modal Retrieval

Conventional representation learning based cross-modal retrieval approaches always represent the sentence with a global embedding feature, which easily neglects the local correlations between objects in the image and phrases in the sentence. In this paper, we present a novel Multi-hop Interactive Cross-modal Retrieval Model (MICRM), which interactively exploits the local correlations between images and words. We design a multi-hop interactive module to infer the high-order relevance between the image and the sentence. Experimental results on two benchmark datasets, MS-COCO and Flickr30K, demonstrate that our multi-hop interactive model performs significantly better than several competitive cross-modal retrieval methods.

Xuecheng Ning, Xiaoshan Yang, Changsheng Xu

### Browsing Visual Sentiment Datasets Using Psycholinguistic Groundings

Recent multimedia applications commonly use text and imagery from Social Media for tasks related to sentiment research. As such, there are various image datasets for sentiment research for popular classification tasks. However, there has been little research regarding the relationship between the sentiment of images and its annotations from a multi-modal standpoint. In this demonstration, we built a tool to visualize psycholinguistic groundings for a sentiment dataset. For each image, individual psycholinguistic ratings are computed from the image’s metadata. A sentiment-psycholinguistic spatial embedding is computed to show a clustering of images across different classes close to human perception. Our interactive browsing tool can visualize the data in various ways, highlighting different psycholinguistic groundings with heatmaps.

Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Daisuke Deguchi, Hiroshi Murase

### Framework Design for Multiplayer Motion Sensing Game in Mixture Reality

Mixed reality (MR) is getting popular, but its application in entertainment is still limited due to the lack of intuitive and various interactions between the user and other players. In this demonstration, we propose an MR multiplayer game framework, which allows the player to interact directly with other players through intuitive body postures/actions. Moreover, a body depth approximation method is designed to decrease the complexity of virtual content rendering without affecting the immersive fidelity while playing the game. Our framework uses deep learning models to achieve motion sensing, and a multiplayer MR interaction game containing a variety of actions is designed to validate the feasibility of the proposed framework.

Chih-Yao Chang, Bo-I Chuang, Chi-Chun Hsia, Wen-Cheng Chen, Min-Chun Hu

### Lyrics-Conditioned Neural Melody Generation

Generating melody from lyrics to compose a song has been a very interesting research topic in the area of artificial intelligence and music, which tries to predict generative music relationship between lyrics and melody. In this demonstration paper, by exploiting a large music dataset with 12,197 pairs of English lyrics and melodies, we develop a lyrics-conditioned AI neural melody generation system that consists of three components: lyrics encoder network, melody generation network, and MIDI sequence tuner. Most importantly, a Long Short-Term Memory (LSTM)-based melody generator conditioned on lyrics, is trained by applying a generative adversarial network (GAN), to generate a pleasing and meaningful melody matching the given lyrics. Our demonstration illustrates the effectiveness of the proposed melody generation system.

Yi Yu, Florian Harscoët, Simon Canales, Gurunath Reddy M, Suhua Tang, Junjun Jiang

### A Web-Based Visualization Tool for 3D Spatial Coverage Measurement of Aerial Images

Drones are becoming popular in different domains, from personal to professional usages. Drones are usually equipped with high-resolution cameras in addition to various sensors (e.g., GPS, accelerometers, and gyroscopes). Therefore, aerial images captured by drones are associated with spatial metadata that describe the spatial extent per image, referred to as aerial field-of-view (Aerial FOV). Aerial FOVs can be utilized to represent the visual coverage of a particular region with respect to various viewing directions at fine granular-levels (i.e., small cells composing the region). In this demo paper, we introduce a web tool for interactive visualization of a collection of aerial field-of-views and instant measurement of their spatial coverage over a given 3D space. This tool is useful for several real-world monitoring applications that are based on aerial images to simulate the 3D spatial coverage of the collected visual data in order to analyze their adequacy.

Abdullah Alfarrarjeh, Zeyu Ma, Seon Ho Kim, Yeonsoo Park, Cyrus Shahabi

### An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Speech enhancement aims to improve speech quality in noisy environments. While most speech enhancement methods use only audio data as input, joining video information can achieve better results. In this paper, we present an attention based speaker-independent audio-visual deep learning model for single channel speech enhancement. We apply both the time-wise attention and spatial attention in the video feature extraction module to focus on more important features. Audio features and video features are then concatenated along the time dimension as the audio-visual features. The proposed video feature extraction module can be spliced to the audio-only model without extensive modifications. The results show that the proposed method can achieve better results than recent audio-visual speech enhancement methods.

Zhongbo Sun, Yannan Wang, Li Cao

### DIME: An Online Tool for the Visual Comparison of Cross-modal Retrieval Models

Cross-modal retrieval relies on accurate models to retrieve relevant results for queries across modalities such as image, text, and video. In this paper, we build upon previous work by tackling the difficulty of evaluating models both quantitatively and qualitatively quickly. We present DIME (Dataset, Index, Model, Embedding), a modality-agnostic tool that handles multimodal datasets, trained models, and data preprocessors to support straightforward model comparison with a web browser graphical user interface. DIME inherently supports building modality-agnostic queryable indexes and extraction of relevant feature embeddings, and thus effectively doubles as an efficient cross-modal tool to explore and search through datasets.

Tony Zhao, Jaeyoung Choi, Gerald Friedland

### Real-Time Demonstration of Personal Audio and 3D Audio Rendering Using Line Array Systems

Control of sound fields using array loudspeakers has been attempted in many practical areas, such as 3D audio, active noise control, and personal audio. In this work, we demonstrate two real-time sound field control systems involving a line array of loudspeakers. The first one, a personal audio system, aims to reproduce two independent sound zones with different audio programs at the same time. By suppressing acoustic interference between two sound zones, the personal audio system allows users at different locations to enjoy independent sounds. In the second demonstration, active control of spatial audio scene is presented. It has been found that the interaction between the radiation from a sound source and surrounding environment is linked with many perceptual cues of spaciousness. Especially, the perceived stage width and distance are strongly related to the interaural cross-correlation and direct-to-reverberation ratio, which can be easily manipulated by changing the directivity of a loudspeaker array. The smooth transition of spaciousness is demonstrated by changing the shapes of multiple beam patterns radiated from the line array.

Jung-Woo Choi

### A CNN-Based Multi-scale Super-Resolution Architecture on FPGA for 4K/8K UHD Applications

In this paper, based on our previous work, we present a multi-scale super-resolution (SR) hardware (HW) architecture using a convolutional neural network (CNN), where the up-scaling factors of 2, 3 and 4 are supported. In our dedicated multi-scale CNN-based SR HW, low-resolution (LR) input frames are processed line-by-line, and the number of convolutional filter parameters is significantly reduced by incorporating depth-wise separable convolutions with residual connections. As for 3× and 4× up-scaling, the number of channels for point-wise convolution layer before a pixel-shuffle layer is set to 9 and 16, respectively. Additionally, we propose an integrated timing generator that supports 3× and 4× up-scaling. For efficient HW implementation, we use a simple and effective quantization method with a minimal peak signal-to-noise ratio (PSNR) degradation. Also, we propose a compression method to efficiently store intermediate feature map data to reduce the number of line memories used in HW. Our CNN-based SR HW implementation on the FPGA can generate 4K ultra high-definition (UHD) frames of higher PSNR at 60 fps, which have higher visual quality compared to conventional CNN-based SR methods that were trained and tested in software. The resources in our CNN-based SR HW can be shared for multi-scale upscaling factors of 2, 3 and 4 so that can be implemented to generate 8K UHD frames from 2K FHD input frames.

Yongwoo Kim, Jae-Seok Choi, Jaehyup Lee, Munchurl Kim

### Effective Utilization of Hybrid Residual Modules in Deep Neural Networks for Super Resolution

Recently, Single-Image Super-Resolution (SISR) has attracted a lot of researchers due to its numerous real-life applications in multiple domains. This paper focuses on efficient solutions of SISR with Hybrid Residual Modules (HRM). The proposed HRM allows the deep neural network to reconstruct very high quality super-resolved images with much lower computation compared to the conventional SISR methods. In this paper, we first describe the technical details of our HRM in SISR and introduce interesting applications of the proposed SISR method, such as surveillance camera system, medical imaging, astronomical imaging.

Abdul Muqeet, Sung-Ho Bae

### diveXplore 4.0: The ITEC Deep Interactive Video Exploration System at VBS2020

Having participated in the three most recent iterations of the annual Video Browser Showdown (VBS2017–VBS2019) as well as in both newly established Lifelog Search Challenges (LSC2018–LSC2019), the actively developed Deep Interactive Video Exploration (diveXplore) system combines a variety of content-based video analysis and processing strategies for interactively exploring large video archives. The system provides a user with browseable self-organizing feature maps, color filtering, semantic concept search utilizing deep neural networks as well as hand-drawn sketch search. The most recent version improves upon its predecessors by unifying deep concepts for facilitating and speeding up search, while significantly refactoring the user interface for increasing the overall system performance.

Andreas Leibetseder, Bernd Münzer, Jürgen Primus, Sabrina Kletz, Klaus Schoeffmann

### Combining Boolean and Multimedia Retrieval in vitrivr for Large-Scale Video Search

This paper presents the most recent additions to the vitrivr multimedia retrieval stack made in preparation for the participation to the 9$$^{th}$$ Video Browser Showdown (VBS) in 2020. In addition to refining existing functionality and adding support for classical Boolean queries and metadata filters, we also completely replaced our storage engine $$\textsf {ADAM}_{pro}$$ by a new database called Cottontail DB. Furthermore, we have added support for scoring based on the temporal ordering of multiple video segments with respect to a query formulated by the user. Finally, we have also added a new object detection module based on Faster-RCNN and use the generated features for object instance search.

Loris Sauter, Mahnaz Amiri Parian, Ralph Gasser, Silvan Heller, Luca Rossetto, Heiko Schuldt

### An Interactive Video Search Platform for Multi-modal Retrieval with Advanced Concepts

The previous version of our retrieval system has shown some significant results in some retrieval tasks such as Lifelog’s moment retrieval tasks. In this paper, we adapt our platform to the Video Browser Showdown’s KIS and AVS tasks and present how our system performs in video search tasks. In addition to the smart features in our retrieval system that take advantage of the provided analysis data, we enhance the data with object color detection by employing Mask R-CNN and clustering. In this version of our search system, we try to extract the location information of the entities appearing in the videos and aim to exploit the spatial relationship between these entities. We also focus on designing efficient user interaction and a high-performance way to transfer data in the system to minimize the retrieval time.

Nguyen-Khang Le, Dieu-Hien Nguyen, Minh-Triet Tran

### VIREO @ Video Browser Showdown 2020

In this paper, we present the features implemented in the 4th version of the VIREO Video Search System (VIREO-VSS). In this version, we propose a sketch-based retrieval model, which allows the user to specify a video scene with objects and their basic properties, including color, size, and location. We further utilize the temporal relation between video frames to strengthen this retrieval model. For text-based retrieval module, we supply speech and on-screen text for free-text search and upgrade the concept bank for concept search. The search interface is also re-designed targeting the novice user. With the introduced system, we expect that the VIREO-VSS can be a competitive participant in the Video Browser Showdown (VBS) 2020.

Phuong Anh Nguyen, Jiaxin Wu, Chong-Wah Ngo, Danny Francis, Benoit Huet

### VERGE in VBS 2020

This paper demonstrates VERGE, an interactive video retrieval engine for browsing a collection of images or videos and searching for specific content. The engine integrates a multitude of retrieval methodologies that include visual and textual searches and further capabilities such as fusion and reranking. All search options and results appear in a web application that aims at a friendly user experience.

Stelios Andreadis, Anastasia Moumtzidou, Konstantinos Apostolidis, Konstantinos Gkountakos, Damianos Galanopoulos, Emmanouil Michail, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris

### VIRET at Video Browser Showdown 2020

During the last three years, the most successful systems at the Video Browser Showdown employed effective retrieval models where raw video data are automatically preprocessed in advance to extract semantic or low-level features of selected frames or shots. This enables users to express their search intents in the form of keywords, sketch, query example, or their combination. In this paper, we present new extensions to our interactive video retrieval system VIRET that won Video Browser Showdown in 2018 and achieved the second place at Video Browser Showdown 2019 and Lifelog Search Challenge 2019. The new features of the system focus both on updates of retrieval models and interface modifications to help users with query specification by means of informative visualizations.

Jakub Lokoč, Gregor Kovalčík, Tomáš Souček

### SOM-Hunter: Video Browsing with Relevance-to-SOM Feedback Loop

This paper presents a prototype video retrieval engine focusing on a simple known-item search workflow, where users initialize the search with a query and then use an iterative approach to explore a larger candidate set. Specifically, users gradually observe a sequence of displays and provide feedback to the system. The displays are dynamically created by a self organizing map that employs the scores based on the collected feedback, in order to provide a display matching the user preferences. In addition, users can inspect various other types of specialized displays for exploitation purposes, once promising candidates are found.

Miroslav Kratochvíl, Patrik Veselý, František Mejzlík, Jakub Lokoč

### Exquisitor at the Video Browser Showdown 2020

When browsing large video collections, human-in-the-loop systems are essential. The system should understand the semantic information need of the user and interactively help formulate queries to satisfy that information need based on data-driven methods. Full synergy between the interacting user and the system can only be obtained when the system learns from the user interactions while providing immediate response. Doing so with dynamically changing information needs for large scale multimodal collections is a challenging task. To push the boundary of current methods, we propose to apply the state of the art in interactive multimodal learning to the complex multimodal information needs posed by the Video Browser Showdown (VBS). To that end we adapt the Exquisitor system, a highly scalable interactive learning system. Exquisitor combines semantic features extracted from visual content and text to suggest relevant media items to the user, based on user relevance feedback on previously suggested items. In this paper, we briefly describe the Exquisitor system, and its first incarnation as a VBS entrant.

Björn Þór Jónsson, Omar Shahbaz Khan, Dennis C. Koelma, Stevan Rudinac, Marcel Worring, Jan Zahálka

### Deep Learning-Based Video Retrieval Using Object Relationships and Associated Audio Classes

This paper introduces a video retrieval tool for the 2020 Video Browser Showdown (VBS). The tool enhances the user’s video browsing experience by ensuring full use of video analysis database constructed prior to the Showdown. Deep learning based object detection, scene text detection, scene color detection, audio classification and relation detection with scene graph generation methods have been used to construct the data. The data is composed of visual, textual, and auditory information, broadening the scope to which a user can search beyond visual information. In addition, the tool provides a simple and user-friendly interface for novice users to adapt to the tool in little time.

Byoungjun Kim, Ji Yea Shim, Minho Park, Yong Man Ro

### IVIST: Interactive VIdeo Search Tool in VBS 2020

This paper presents a new video retrieval tool, Interactive VIdeo Search Tool (IVIST), which participates in the 2020 Video Browser Showdown (VBS). As a video retrieval tool, IVIST is equipped with proper and high-performing functionalities such as object detection, dominant-color finding, scene-text recognition and text-image retrieval. These functionalities are constructed with various deep neural networks. By adopting these functionalities, IVIST performs well in searching users’ desirable videos. Furthermore, due to user-friendly user interface, IVIST is easy to use even for novice users. Although IVIST is developed to participate in VBS, we hope that it will be applied as a practical video retrieval tool in the future, dealing with actual video data on the Internet.

Sungjune Park, Jaeyub Song, Minho Park, Yong Man Ro

### Backmatter

Weitere Informationen