main-content

The two-volume set LNCS 11961 and 11962 constitutes the thoroughly refereed proceedings of the 25th International Conference on MultiMedia Modeling, MMM 2020, held in Daejeon, South Korea, in January 2020.

Of the 171 submitted full research papers, 40 papers were selected for oral presentation and 46 for poster presentation; 28 special session papers were selected for oral presentation and 8 for poster presentation; in addition, 9 demonstration papers and 6 papers for the Video Browser Showdown 2020 were accepted. The papers of LNCS 11961 are organized in the following topical sections: audio and signal processing; coding and HVS; color processing and art; detection and classification; face; image processing; learning and knowledge representation; video processing; poster papers; the papers of LNCS 11962 are organized in the following topical sections: poster papers; AI-powered 3D vision; multimedia analytics: perspectives, tools and applications; multimedia datasets for repeatable experimentation; multi-modal affective computing of large-scale multimedia data; multimedia and multimodal analytics in the medical domain and pervasive environments; intelligent multimedia security; demo papers; and VBS papers.

### Correction to: MultiMedia Modeling

The original version of this book was revised. Due to a technical error, the first volume editor did not appear in the volumes of the MMM 2020 proceedings. This was corrected and the first volume editor was added.

Yong Man Ro, Wen-Huang Cheng, Junmo Kim, Wei-Ta Chu, Peng Cui, Jung-Woo Choi, Min-Chun Hu, Wesley De Neve

### Light Field Reconstruction Using Dynamically Generated Filters

Densely-sampled light fields have already show unique advantages in applications such as depth estimation, refocusing, and 3D presentation. But it is difficult and expensive to access. Commodity portable light field cameras, such as Lytro and Raytrix, are easy to carry and easy to operate. However, due to the camera design, there is a trade-off between spatial and angular resolution, which can not be sampled intensively at the same time. In this paper, we present a novel learning-based light field reconstruction approach to increase the angular resolution of a sparsely-sample light field image. Our approach treats the reconstruction problem as the filtering operation on the sub-aperture images of input light field and uses a deep neural network to estimate the filtering kernels for each sub-aperture image. Our network adopts a U-Net structure to extract feature maps from input sub-aperture images and angular coordinate of novel view, then a filter-generating component is designed for kernel estimation. We compare our method with existing light field reconstruction methods with and without depth information. Experiments show that our method can get much better results both visually and quantitatively.

Xiuxiu Jing, Yike Ma, Qiang Zhao, Ke Lyu, Feng Dai

### Speaker-Aware Speech Emotion Recognition by Fusing Amplitude and Phase Information

The use of a convolutional neural network (CNN) for extracting deep acoustic features from spectrograms has become one of the most commonly used methods for speech emotion recognition. In those studies, however, common amplitude information is chosen as input with no special attention to phase-related or speaker-related information. In this paper, we propose a multi-channel method employing amplitude and phase channels for speech emotion recognition. Two separate CNN channels are adopted to extract deep features from amplitude spectrograms and modified group delay (MGD) spectrograms. Then a concatenate layer is used to fuse the features. Furthermore, to gain more robust features, speaker information is considered in the stage of emotional feature extraction. Finally, the fusion features that considering speaker-related information are fed into the extreme learning machine (ELM) to distinguish emotions. Experiments are conducted on the Emo-DB database to evaluate the proposed model. Results demonstrate the recognition performance of average F1 in 94.82%, which significantly outperforms the baseline CNN-ELM model based on amplitude only spectrograms by 39.27% relative error reduction.

Lili Guo, Longbiao Wang, Jianwu Dang, Zhilei Liu, Haotian Guan

### Gen-Res-Net: A Novel Generative Model for Singing Voice Separation

In most cases, modeling in the time-frequency domain is the most common method to solve the problem of singing voice separation since frequency characteristics differ between different sources. During the past few years, applying recurrent neural network (RNN) to series of split spectrograms has been mostly adopted by researchers to tackle this problem. Recently, however, the U-net’s success has drawn the focus to treating the spectrogram as a 2-dimensional image with an auto-encoder structure, which indicates that some useful methods in image analysis may help solve this problem. Under this scenario, we propose a novel spectrogram-generative model to separate the two sources in the time-frequency domain inspired by Residual blocks, Squeeze and Excitation blocks and WaveNet. We apply none-reduce-sized Residual blocks together with Squeeze and Excitation blocks in the main stream to extract features of the input spectrogram while gathering the output layers in a skip-connection structure used in WaveNet. Experimental results on two datasets (MUSDB18 and CCMixer) have shown that our proposed network performs better than the current state-of-the-art approach working on spectrograms of mixtures – the deep U-net structure.

Congzhou Tian, Hangyu Li, Deshun Yang, Xiaoou Chen

### A Distinct Synthesizer Convolutional TasNet for Singing Voice Separation

Deep learning methods have already been used for music source separation for several years and proved to be very effective. Most of them choose Fourier Transform as the front-end process to get a spectrogram representation, which has its drawback though. Perhaps the spectrogram representation is just suitable for human to understand sounds, but not the best representation used by powerful neural networks for singing voice separation. TasNet (Time Audio Separation Network) has been proposed recently to solve monaural speech separation in the time domain by modeling each source as a weighted sum of a common set of basis signals. Then the fully-convolutional TasNet raised recently achieves great improvements in speech separation. In this paper, we first show convolutional TasNet can also be used in singing voice separation and bring about improvements on the dataset DSD100 in the singing voice separation task. Then based on the fact that in singing voice separation, the difference between the singing voice and the accompaniment is far more remarkable than the difference between the voices of two different people in speech separation, we employ separate sets of basis signals and separate encoder outputs for the singing voice and the accompaniment respectively, which makes a further improved model, distinct synthesizer convolutional TasNet (ds-cTasNet).

Congzhou Tian, Deshun Yang, Xiaoou Chen

### Exploiting the Importance of Personalization When Selecting Music for Relaxation

Listening to music is not just a hobby or a leisure activity, but rather a way to achieve a specific emotional or psychological state, or even to better perform an activity, e.g., relaxation. Therefore, making the right choice of music for this purpose is fundamental. In the area of Music Information Retrieval (MIR), many works try to classify, in a general way, songs that are better suited to certain activities/contexts, but there is a lack of works that seek first to answer if personalization is an essential criterion for the selection of songs in this context. Thus, in order to investigate whether personalization plays a vital role in this context, more specifically in relaxation, we: (i) analyze more than 60 thousand playlists created by more than 5 thousand users from the 8tracks platform; (ii) extract high and low-level features from the songs of these playlists; (iii) create a user perception based on these features; (iv) identify users groups by their perceptions; (v) analyze the contrasts between these groups, comparing their most relevant features. Our results help to understand how personalization is essential in the context of relaxing music, paving the way for more informed MIR systems.

### An Efficient Encoding Method for Video Compositing in HEVC

Video compositing for compressed HEVC streams is highly demanded in instant communication applications such as video chat. As a flexible scheme, pixel domain compositing involves the processes of completely decoding the streams, inserting the decoded video, and finally re-encoding the new composite video. The time consumption of the whole scheme comes almost entirely from the re-encoding process. In this paper, we propose an efficient encoding method to speedup the re-encoding process and improve encoding efficiency. The proposed method separately designs encoding for blocks inside the frame region covered by inserted video and blocks in non-inserted region, which overcomes numerous difficulties of utilizing information from the decoding process. Experimental results show that the proposed method achieves a $$\text {PSNR}$$ increase of 0.33 dB, or a bit rate saving of 10.11% on average compared with normal encoding using unmodified HM software. Furthermore, the encoding speed is 7.04 times that of normal encoding method, equivalent to an average reduction of 85.8% in computational complexity.

Yunchang Li, Zhijie Huang, Jun Sun

There are large amount of valuable video archives in Video Home System (VHS) format. However, due to the analog nature, their quality is often poor. Compared to High-definition television (HDTV), VHS video not only has a dull color appearance but also has a lower resolution and often appears blurry. In this paper, we focus on the problem of translating VHS video to HDTV video and have developed a solution based on a novel unsupervised multi-task adversarial learning model. Inspired by the success of generative adversarial network (GAN) and CycleGAN, we employ cycle consistency loss, adversarial loss and perceptual loss together to learn a translation model. An important innovation of our work is the incorporation of super-resolution model and color transfer model that can solve unsupervised multi-task problem. To our knowledge, this is the first work that dedicated to the study of the relation between VHS and HDTV and the first computational solution to translate VHS to HDTV. We present experimental results to demonstrate the effectiveness of our solution qualitatively and quantitatively.

Hongming Luo, Guangsen Liao, Xianxu Hou, Bozhi Liu, Fei Zhou, Guoping Qiu

### Improving Just Noticeable Difference Model by Leveraging Temporal HVS Perception Characteristics

Temporal HVS characteristics are not fully exploited in conventional JND models. In this paper, we improve the spatio-temporal JND model by fully leveraging the temporal HVS characteristics. From the viewpoint of visual attention, we investigate two related factors, positive stimulus saliency and negative uncertainty. This paper measures the stimulus saliency according to two stimulus-driven parameters, relative motion and duration along the motion trajectory, and measures the uncertainty according to two uncertainty-driven parameters, global motion and residue intensity fluctuation. These four different parameters are measured with self-information and information entropy, and unified for fusion with homogeneity. As a result, a novel temporal JND adjustment weight model is proposed. Finally, we fuse the spatial JND model and temporal JND weight to form the spatio-temporal JND model. The experiment results verify that the proposed JND model yields significant performance improvement with much higher capability of distortion concealment compared to state-of-the-art JND profiles.

Haibing Yin, Yafen Xing, Guangjing Xia, Xiaofeng Huang, Chenggang Yan

### Down-Sampling Based Video Coding with Degradation-Aware Restoration-Reconstruction Deep Neural Network

Recently deep learning techniques have shown remarkable progress in image/video super-resolution. These techniques can be employed in a video coding system for improving the quality of the decoded frames. However, different from the conventional super-resolution works, the compression artifacts in the decoded frames should be concerned with. The straightforward solution is to integrate the artifacts removing techniques before super-resolution. Nevertheless, some helpful features may be removed together with the artifacts, and remaining artifacts can be exaggerated. To address these problems, we design an end-to-end restoration-reconstruction deep neural network (RR-DnCNN) using the degradation-aware techniques. RR-DnCNN is applied to the down-sampling based video coding system. In the encoder side, the original video is down-sampled and compressed. In the decoder side, the decompressed down-sampled video is fed to the RR-DnCNN to get the original video by removing the compression artifacts and super-resolution. Moreover, in order to enhance the network learning capabilities, uncompressed low-resolution images/videos are utilized as a ground-truth. The experimental results show that our work can obtain over 8% BD-rate reduction compared to the standard H.265/HEVC. Furthermore, our method also outperforms in reducing compression artifacts in subjective comparison. Our work is available at https://github.com/minhmanho/rrdncnn .

Minh-Man Ho, Gang He, Zheng Wang, Jinjia Zhou

### Beyond Literal Visual Modeling: Understanding Image Metaphor Based on Literal-Implied Concept Mapping

Existing ultimedia content understanding tasks focus on modeling the literal semantics of multimedia documents. This study explores the possibility of understanding the implied meaning behind the literal semantics. Inspired by human’s implied imagination process, we introduce a three-step solution framework based on the mapping from literal to implied concepts by integrating external knowledge. Experiments on self-collected metaphor image dataset validate the effectiveness in identifying accurate implied concepts for further metaphor understanding in controlled environment.

Chengpeng Fu, Jinqiang Wang, Jitao Sang, Jian Yu, Changsheng Xu

### Deep Palette-Based Color Decomposition for Image Recoloring with Aesthetic Suggestion

Color edition is an important issue in image processing and graphic design. This paper presents a deep color decomposition based framework for image recoloring, allowing users to achieve professional color edition through simple interactive operations. Different from existing methods that perform palette generation and color decomposition separately, our method directly generates the recolored images by a light-weight CNN. We first formulate the generation of color palette as an unsupervised clustering problem, and employ a fully point-wise CNN to extract the most representative colors from the input image. Particularly, a pixel scrambling strategy is adopted to map the continuous image color to a compact discrete palette space, facilitating the CNN focus on color-relevant features. Then, we devise a deep color decomposition network to obtain the projected weights of input image on the basis colors of the generated palette space, and leverage them for image recoloring in a user-interacted manner. In addition, a novel aesthetic constraint derived from color harmony theory is proposed to augment the color reconstruction from user-specified colors, resulting in an aesthetically pleasing visual effect. Qualitative comparisons with existing methods demonstrate the effectiveness of our proposed method.

Zhengqing Li, Zhengjun Zha, Yang Cao

### On Creating Multimedia Interfaces for Hybrid Biological-Digital Art Installations

This paper discusses the application of real-time multimedia technologies to artworks that feature novel interfaces between human and non-human organisms (in this case plants and bacteria). Two projects are discussed: Microbial Sonorities, a real-time generative sound artwork based on bacterial voltages and machine learning, and PlantConnect, a real-time multimedia artwork that explores human-plant interaction via the human act of breathing, the bioelectrical and photosynthetic activity of plants and computational intelligence to bring the two together. Part of larger investigations into alternative models for the creation of shared experiences and understanding with the natural world, these projects explores complexity and emergent phenomena by harnessing the material agency of non-human organisms and the capacity of emerging multimedia technologies as mediums for information transmission, communication and interconnectedness between the human and non-human. While primarily focusing on technical descriptions of these projects, this paper also hopes to open up dialog about how the combination of emerging multimedia technologies and the often aleatoric unpredictability that living organisms can exhibit, can be beneficial to digital arts and entertainment applications.

Carlos Castellanos, Bello Bello, Hyeryeong Lee, Mungyu Lee, Yoo Seok Lee, In Seop Chang

### Image Captioning Based on Visual and Semantic Attention

Most of the existing image captioning methods only use the visual information of the image to guide the generation of the captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus intensity on the image. In this paper, we first propose an improved visual attention model. At each time step, we calculate the focus intensity coefficient of the attention mechanism through the context information of the model, and automatically adjust the focus intensity of the attention mechanism through the coefficient, so as to extract more accurate image visual information. In addition, we represent the scene semantic information of the image through some topic words related to the image scene, and add them to the language model. We use attention mechanism to determine the image visual information and scene semantic information that the model pays attention to at each time step, and combine them to guide the model to generate more accurate and scene-specific captions. Finally, we evaluate our model on MSCOCO dataset. The experimental results show that our approach can generate more accurate captions, and outperforms many recent advanced models on various evaluation metrics.

Haiyang Wei, Zhixin Li, Canlong Zhang

### An Illumination Insensitive and Structure-Aware Image Color Layer Decomposition Method

To decompose of an image into a set of color layers can facilitate many image editing manipulations, but the high-quality layering remains challenging. We propose a novel illumination insensitive and structure-aware layer decomposition approach. To reduce the influence of non-uniform illumination on color appearance, we design a scheme of letting the decomposition work on the reflectance image output by intrinsic decomposition, rather than the original image commonly used in previous work. To obtain fine layers, we leverage image specific structure information and enforce it by encoding a structure-aware prior into a novel energy minimization formulation. The proposed optimization considers the fidelity, our structure-aware prior and permissible ranges simultaneously. We provide a solver to this optimization to get final layers. Experiments demonstrate that our method can obtain finer layers compared to several state-of-the-art methods.

Wengang Cheng, Pengli Dou, Dengwen Zhou

### CartoonRenderer: An Instance-Based Multi-style Cartoon Image Translator

Instance based photo cartoonization is one of the challenging image stylization tasks which aim at transforming realistic photos into cartoon style images while preserving the semantic contents of the photos. State-of-the-art Deep Neural Networks (DNNs) methods still fail to produce satisfactory results with input photos in the wild, especially for photos which have high contrast and full of rich textures. This is due to that: cartoon style images tend to have smooth color regions and emphasized edges which are contradict to realistic photos which require clear semantic contents, i.e., textures, shapes etc. Previous methods have difficulty in satisfying cartoon style textures and preserving semantic contents at the same time. In this work, we propose a novel “CartoonRenderer” framework which utilizing a single trained model to generate multiple cartoon styles. In a nutshell, our method maps photo into a feature model and renders the feature model back into image space. In particular, cartoonization is achieved by conducting some transformation manipulation in the feature space with our proposed Soft-AdaIN. Extensive experimental results show our method produces higher quality cartoon style images than prior arts, with accurate semantic content preservation. In addition, due to the decoupling of whole generating process into “Modeling-Coordinating-Rendering” parts, our method could easily process higher resolution photos, which is intractable for existing methods.

Yugang Chen, Muchun Chen, Chaoyue Song, Bingbing Ni

### Multi-condition Place Generator for Robust Place Recognition

As an image retrieval task, visual place recognition (VPR) encounters two technical challenges: appearance variations resulted from external environment changes and the lack of cross-domain paired training data. To overcome these challenges, multi-condition place generator (MPG) is introduced for data generation. The objective of MPG is two-fold, (1) synthesizing realistic place samples corresponding to multiple conditions; (2) preserving the place identity information during the generation procedure. While MPG smooths the appearance disparities under various conditions, it also suffers image distortion. For this reason, we propose the relative quality based triplet (RQT) loss by reshaping the standard triplet loss such that it down-weights the loss assigned to low-quality images. By taking advantage of the innovations mentioned above, a condition-invariant VPR model is trained without the labeled training data. Comprehensive experiments show that our method outperforms state-of-the-art algorithms by a large margin on several challenging benchmarks.

Yiting Cheng, Yankai Wang, Lizhe Qi, Wenqiang Zhang

### Guided Refine-Head for Object Detection

Lingyun Zeng, You Song, Wenhai Wang

### Towards Accurate Panel Detection in Manga: A Combined Effort of CNN and Heuristics

Panels are the fundamental elements of manga pages, and hence their detection serves as the basis of high-level manga content understanding. Existing panel detection methods could be categorized into heuristic-based methods and CNN-based (Convolutional Neural Network-based) ones. Although the former can accurately localize panels, they cannot handle well elaborate panels and require considerable effort to hand-craft rules for every new hard case. In contrast, detection results of CNN-based methods could be rough and inaccurate. We utilize CNN object detectors to propose coarse guide panels, then use heuristics to propose panel candidates and finally optimize an energy function to select the most plausible candidates. CNN assures roughly localized detection of almost all kinds of panels, while the follow-up procedure refines the detection results and minimizes the margin between detected panels and ground-truth with the help of heuristics and energy minimization. Experimental results show the proposed method surpasses previous methods regarding panel detection F1-score and page accuracy.

Yafeng Zhou, Yongtao Wang, Zheqi He, Zhi Tang, Ching Y. Suen

### Subclass Deep Neural Networks: Re-enabling Neglected Classes in Deep Network Training for Multimedia Classification

During minibatch gradient-based optimization, the contribution of observations to the updating of the deep neural network’s (DNN’s) weights for enhancing the discrimination of certain classes can be small, despite the fact that these classes may still have a large generalization error. This happens, for instance, due to overfitting, i.e. to classes whose error in the training set is negligible, or simply when the contributions of the misclassified observations to the updating of the weights associated with these classes cancel out. To alleviate this problem, a new criterion for identifying the so-called “neglected” classes during the training of DNNs, i.e. the classes which stop to optimize early in the training procedure, is proposed. Moreover, based on this criterion a novel cost function is proposed, that extends the cross-entropy loss using subclass partitions for boosting the generalization performance of the neglected classes. In this way, the network is guided to emphasize the extraction of features that are discriminant for the classes that are prone to being neglected during the optimization procedure. The proposed framework can be easily applied to improve the performance of various DNN architectures. Experiments on several publicly available benchmarks including, the large-scale YouTube-8M (YT8M) video dataset, show the efficacy of the proposed method (Source code is made publicly available at: https://github.com/bmezaris/subclass_deep_neural_networks ).

Nikolaos Gkalelis, Vasileios Mezaris

### Automatic Material Classification Using Thermal Finger Impression

Natural surfaces offer the opportunity to provide augmented reality interactions in everyday environments without the use of cumbersome body-mounted equipment. One of the key techniques of detecting user interactions with natural surfaces is the use of thermal imaging that captures the transmitted body heat onto the surface. A major challenge of these systems is detecting user swipe pressure on different material surfaces with high accuracy. This is because the amount of transferred heat from the user body to a natural surface depends on the thermal property of the material. If the surface material type is known, these systems can use a material-specific pressure classifier to improve the detection accuracy. In this work, we address to solve this problem as we propose a novel approach that can detect material type from a user’s thermal finger impression on a surface. Our technique requires the user to touch a surface with a finger for 2 s. The recorded heat dissipation time series of the thermal finger impression is then analyzed in a classification framework for material identification. We studied the interaction of 15 users on 7 different material types, and our algorithm is able to achieve 74.65% material classification accuracy on the test data in a user-independent manner.

Jacob Gately, Ying Liang, Matthew Kolessar Wright, Natasha Kholgade Banerjee, Sean Banerjee, Soumyabrata Dey

### Face Attributes Recognition Based on One-Way Inferential Correlation Between Attributes

Attributes recognition of face in the wild is getting increasingly attention with the rapid development of computer vision. Most prior work tend to apply separate model for the single attribute or attributes in the same region, which easily lost the information of correlation between attributes. Correlation (e.g., one-way inferential correlation) between face attributes, which is neglected by many researches, contributes to the better performance of face attributes recognition. In this paper, we propose a face attributes recognition model based on one-way inferential correlation (OIR) between face attributes (e.g., the inferential correlation from goatee to gender). Toward that end, we propose a method to find such correlation based on data imbalance of each attribute, and design an OIR-related attributes classifier using such correlation. Furthermore, we cut face region into multiple region parts according to the category of attributes, and use a novel approach of face feature extraction for all regional parts via transfer learning focusing on multiple neural layers. Experimental evaluations on the benchmark with multiple face attributes show the effectiveness on recognition accuracy and computational cost of our proposed model.

Hongkong Ge, Jiayuan Dong, Liyan Zhang

### Eulerian Motion Based 3DCNN Architecture for Facial Micro-Expression Recognition

Facial micro-expressions are fast and subtle muscular movements, which typically reveal the underlying mental state of an individual. Due to low intensity and short duration of micro-expressions, the task of micro-expressions recognition is a huge challenge. Our method adopts a new pre-processing technique on the basis of the Eulerian video magnification (EVM) for micro-expressions recognition. Further, we propose a micro-expressions recognition framework based on the simple yet effective Eulerian motion-based 3D convolution network (EM-C3D). Firstly, Eulerian motion feature maps are extracted by employing multiple spatial scales temporal filtering approach, then the multi-frame Eulerian motion feature maps are directly fed into the 3D convolution network with a global attention module (GAM) to encode rich spatiotemporal information instead of being added to the raw images. Our algorithm achieves state-of-the-art result $$69.76\%$$ accuracy and $$65.75\%$$ recall rate on the CASME II dataset, which surpasses all baselines. Cross-domain experiments are also performed to verify the robustness of the algorithm.

Yahui Wang, Huimin Ma, Xinpeng Xing, Zeyu Pan

### Emotion Recognition with Facial Landmark Heatmaps

Facial expression recognition is a very challenging problem and has attracted more and more researchers’ attention. In this paper, considering that facial expression recognition is closely related to the features of key facial regions, we propose a facial expression recognition network that explicitly utilizes the landmark heatmap information to precisely capture the most discriminative features. In addition to directly adding the information of facial fiducial points in the form of landmark heatmaps, we also propose an end-to-end network structure--heatmap aiding emotion network (HAE-Net) by embedding the landmark detection module based on stack-based hourglass network into the facial expression recognition network. Experiments on CK+, RAF and AffectNet databases show that our method achieves better results compared with the state-of-the-art methods, which demonstrates that adding additional landmark information, as well as joint training of landmark detection and expression recognition, are beneficial to improve recognition performance.

Siyi Mo, Wenming Yang, Guijin Wang, Qingmin Liao

### One-Shot Face Recognition with Feature Rectification via Adversarial Learning

One-shot face recognition has attracted extensive attention with the ability to recognize persons at just one glance. With only one training sample which cannot represent intra-class variance adequately, one-shot classes have poor generalization ability, and it is difficult to obtain appropriate classification weights. In this paper, we explore an inherent relationship between features and classification weights. In detail, we propose feature rectification generative adversarial network (FR-GAN) which is able to rectify features closer to corresponding classification weights considering existing classification weights information. With one model, we achieve two purposes: without fine-tuning via back propagation as previous CNN approaches which are time consuming and computationally expensive, FR-GAN can not only (1) generate classification weights for new classes using training data, but also (2) achieve more discriminative test feature representation. The experimental results demonstrate the remarkable performance of our proposed method, as in MS-Celeb-1M one-shot benchmark, our method achieves 93.12% coverage at 99% precision with the introduction of novel classes and remains a high accuracy at 99.80% for base classes, surpassing most of the previous approaches based on fine-tuning.

Jianli Zhou, Jun Chen, Chao Liang, Jin Chen

### Visual Sentiment Analysis by Leveraging Local Regions and Human Faces

Visual sentiment analysis (VSA) is a challenging task which attracts wide attention from researchers for its great application potentials. Existing works for VSA mostly extract global representations of images for sentiment prediction, ignoring the different contributions of local regions. Some recent studies analyze local regions separately and achieve improvements on the sentiment prediction performance. However, most of them treat regions equally in the feature fusion process which ignores their distinct contributions or use a global attention map whose performance is easily influenced by noises from non-emotional regions. In this paper, to solve these problems, we propose an end-to-end deep framework to effectively exploit the contributions of local regions to VSA. Specifically, a Sentiment Region Attention (SRA) module is proposed to estimate contributions of local regions with respect to the global image sentiment. Features of these regions are then reweighed and further fused according to their estimated contributions. Moreover, since the image sentiment is usually closely related to humans appearing in the image, we also propose to model the contribution of human faces as a special local region for sentiment prediction. Experimental results on publicly available and widely used datasets for VSA demonstrate our method outperforms state-of-the-art algorithms.

Ruolin Zheng, Weixin Li, Yunhong Wang

### Prediction-Error Value Ordering for High-Fidelity Reversible Data Hiding

Prediction-error expansion (PEE) is the most widely utilized reversible data hiding (RDH) technique. However, the performance of PEE is far from optimal since the correlations among prediction-errors are not fully exploited. Then, to enhance the embedding performance of PEE, a new RDH method named prediction-error value ordering (PEVO) is proposed in this paper. The main idea of PEVO is to exploit the inter-correlations of prediction-errors by combining PEE with the recent RDH technique of pixel value ordering. Specifically, the prediction-errors within an image block are first sorted, and then the maximum and minimum prediction-errors of the block are predicted and modified for data embedding. By the proposed approach, the image redundancy is better exploited and promising embedding performance is achieved. Experimental results demonstrate that the proposed PEVO embedding method is better than the PEE-based ones and some other state-of-the-art works.

Tong Zhang, Xiaolong Li, Wenfa Qi, Zongming Guo

### Classroom Attention Analysis Based on Multiple Euler Angles Constraint and Head Pose Estimation

Classroom attention analysis aims to capture rich semantic information to analyze how the students are reacting to the lecture. However, there are some challenges for constructing a uniform attention model of students in the classroom. Each student is an individual and it is hard to make a unified judgment. The orientation of the head reflects the direction of attention, but changes in posture and space can interfere with the direction of attention. Aiming to solve these, this paper proposes a scoring module for converting the head Euler angle and attention in the classroom. This module takes the head Euler angle in three directions as input, and introduces spatial information to correct the angle. The key idea of the proposed method lies in introducing the mutual constraint of multiple Euler angles with the head spatial information, aiming to make attention model less susceptible to the difference of head information. The mutual constraint of multiple Euler angles can provide more accurate head information, while the head spatial information can be utilized to correct the angle. Extensive experiments using classroom video data demonstrate that the proposed method can achieve more accurate results.

Xin Xu, Xin Teng

### Multi-branch Body Region Alignment Network for Person Re-identification

Person re-identification (Re-ID) aims to identify the same person images from a gallery set across different cameras. Human pose variations, background clutter and misalignment of detected human images pose challenges for Re-ID tasks. To deal with these issues, we propose a Multi-branch Body Region Alignment Network (MBRAN), to learn discriminative representations for person Re-ID. It consists of two modules, i.e., body region extraction and feature learning. Body region extraction module utilizes a single-person pose estimation method to estimate human keypoints and obtain three body regions. In the feature learning module, four global or local branch-networks share base layers and are designed to learn feature representation on three overlapping body regions and the global image. Extensive experiments have indicated that our method outperforms several state-of-the-art methods on two mainstream person Re-ID datasets.

Han Fang, Jun Chen, Qi Tian

### DeepStroke: Understanding Glyph Structure with Semantic Segmentation and Tabu Search

Glyphs in many writing systems (e.g., Chinese) are composed of a sequence of strokes written in a specific order. Glyph structure interpreting (i.e., stroke extraction) is one of the most important processing steps in many tasks including aesthetic quality evaluation, handwriting synthesis, character recognition, etc. However, existing methods that rely heavily on accurate shape matching are not only time-consuming but also unsatisfactory in stroke extraction performance. In this paper, we propose a novel method based on semantic segmentation and tabu search to interpret the structure of Chinese glyphs. Specifically, we first employ an improved Fully Convolutional Network (FCN), DeepStroke, to extract strokes, and then use the tabu search to obtain the order how these strokes are drawn. We also build the Chinese Character Stroke Segmentation Dataset (CCSSD) consisting of 67630 character images that can be equally classified into 10 different font styles. This dataset provides a benchmark for both stroke extraction and semantic segmentation tasks. Experimental results demonstrate the effectiveness and efficiency of our method and validate its superiority against the state of the art.

Wenguang Wang, Zhouhui Lian, Yingmin Tang, Jianguo Xiao

### 3D Spatial Coverage Measurement of Aerial Images

Unmanned aerial vehicles (UAVs) such as drones are becoming significantly prevalent in both daily life (e.g., event coverage, tourism) and critical situations (e.g., disaster management, military operations), generating an unprecedented number of aerial images and videos. UAVs are usually equipped with various sensors (e.g., GPS, accelerometers and gyroscopes) so provide sufficient spatial metadata that describe the spatial extent (referred to as aerial field-of-view) of recorded imagery. Such spatial metadata can be used efficiently to answer a fundamental question about how well a collection of aerial imagery covers a certain area spatially by evaluating the adequacy of the collected aerial imagery and estimating their sufficiency. This paper provides an answer to such questions by introducing 3D spatial coverage measurement models to collectively quantify the spatial and directional coverage of a geo-tagged aerial image dataset. Through experimental evaluation using real datasets, the paper demonstrates that our proposed models can be implemented with highly efficient computation of 3D space geometry.

Abdullah Alfarrarjeh, Zeyu Ma, Seon Ho Kim, Cyrus Shahabi

### Instance Image Retrieval with Generative Adversarial Training

While generative adversarial training becomes promising technology for many computer vision tasks especially in image processing domain, it has few works so far on instance level image retrieval domain. In this paper, we propose an instance level image retrieval method with generative adversarial training (ILRGAN). In this proposal, adversarial training is adopted in the retrieval procedure. Both generator and discriminator are redesigned for retrieval task: the generator tries to retrieve similar images and passes them to the discriminator. And the discriminator tries to discriminate the dissimilar images from the images retrieved and then passes the decision to the generator. Generator and discriminator play min-max game until the generator retrieves images that the discriminator can not discriminate the dissimilar images. Experiments on four widely used databases show that adversarial training really works for instance level image retrieval and the proposed ILRGAN can get promising retrieval performances.

Hongkai Li, Cong Bai, Ling Huang, Yugang Jiang, Shengyong Chen

### An Effective Way to Boost Black-Box Adversarial Attack

Deep neural networks (DNNs) are vulnerable to adversarial examples. Generally speaking adversarial examples are defined by adding input samples a small-magnitude perturbation, which is hardly misleading human observers’ decision but would lead to misclassifications for a well trained models. Most of existing iterative adversarial attack methods suffer from low success rates in fooling model in a black-box manner. And we find that it is because perturbation neutralize each other in iterative process. To address this issue, we propose a novel boosted iterative method to effectively promote success rates. We conduct the experiments on ImageNet dataset, with five models normally trained for classification. The experimental results show that our proposed strategy can significantly improve success rates of fooling models in a black-box manner. Furthermore, it also outperforms the momentum iterative method (MI-FSGM), which won the first places in NeurIPS Non-targeted Adversarial Attack and Targeted Adversarial Attack competitions.

Xinjie Feng, Hongxun Yao, Wenbin Che, Shengping Zhang

### Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain

We present a new solution towards building a crowd knowledge enhanced multimodal conversational system for travel. It aims to assist users in completing various travel-related tasks, such as searching for restaurants or things to do, in a multimodal conversation manner involving both text and images. In order to achieve this goal, we ground this research on the combination of multimodal understanding and recommendation techniques which explores the possibility of a more convenient information seeking paradigm. Specifically, we build the system in a modular manner where each modular construction is enriched with crowd knowledge from social sites. To the best of our knowledge, this is the first work that attempts to build intelligent multimodal conversational systems for travel, and moves an important step towards developing human-like assistants for completion of daily life tasks. Several current challenges are also pointed out as our future directions.

Lizi Liao, Lyndon Kennedy, Lynn Wilcox, Tat-Seng Chua

### Improved Model Structure with Cosine Margin OIM Loss for End-to-End Person Search

End-to-end person search is a novel task that integrates pedestrian detection and person re-identification (re-ID) into a joint optimization framework. However, the pedestrian features learned by most existing methods are not discriminative enough due to the potential adverse interaction between detection and re-ID tasks and the lack of discriminative power of re-ID loss. To this end, we propose an Improved Model Structure (IMS) with a novel re-ID loss function called Cosine Margin Online Instance Matching (CM-OIM) loss. Firstly, we design a model structure more suitable for person search, which alleviates the adverse interaction between the detection and re-ID parts by reasonably decreasing the network layers shared by them. Then, we conduct a full investigation of the weight of re-ID loss, which we argue plays an important role in end-to-end person search models. Finally, we improve the Online Instance Matching (OIM) loss by adopting a more robust online update strategy, and importing a cosine margin into it to increase the intra-class compactness of the features learned. Extensive experiments on two challenging datasets CUHK-SYSU and PRW demonstrate our approach outperforms the state-of-the-arts.

Haoran Chen, Minghua Zhu, Xuesong Cai, Jufeng Luo, Yunzhou Qiu

### Effective Barcode Hunter via Semantic Segmentation in the Wild

Barcodes are popularly used for product identification in many scenarios. However, locating them on product images is challenging. Half-occlusion, distortion, darkness or targets being too small to recognize can often add to the difficulties using conventional methods. In this paper, we introduce a large-scale diverse barcode dataset and adopt a deep learning-based semantic segmentation approach to address these problems. Specifically, we use an efficient method to synthesize 30000 well-annotated images containing diverse barcode labels, and get Barcode-30 k, a large-scale dataset with accurate pixel-level annotated barcode in the wild. Moreover, to locate barcode more precisely, we further propose an Effective Barcode Hunter - BarcodeNet. It is a semantic segmentation model based on CNN (Convolutional Neural Network) and is mainly formed with two novel modules, Prior Pyramid Pooling Module (P3M) and Pyramid Refine Module (PRM). Additional ablation studies further demonstrate the effectiveness of BarcodeNet, and it yields a high mIoU result of 95.36% on the proposed synthetic Barcode-30 k validation-set. To prove the practical value of the whole system, we test the BarcodeNet trained on train-set of Barcode-30 k on a manually annotated testing set that only collected from cameras, it achieves mIoU of 90.3%, which is a very accurate result for practical applications.

Feng Ni, Xixin Cao

### Wonderful Clips of Playing Basketball: A Database for Localizing Wonderful Actions

Video highlight detection, or wonderful clip localization, aims at automatically discovering interesting clips in untrimmed videos, which can be applied to a variety of scenarios in real world. With reference to its study, a video dataset of Wonderful Clips of Playing Basketball (WCPB) is developed in this work. The Segment-Convolutional Neural Network (S-CNN), a start-of-the-art model for temporal action localization, is adopted to localize wonderful clips and a two-stream S-CNN is designed which outperforms its former on WCPB. The WCPB dataset presents the specific meaning of wonderful clips and annotations in playing basketball and enables the measurement of performance and progress in other realistic scenarios.

Qinyu Li, Lijun Chen, Hanli Wang, Xianhui Liu

### Structural Pyramid Network for Cascaded Optical Flow Estimation

To achieve a better balance of accuracy and computational complexity for optical flow estimation, a Structural Pyramid Network (StruPyNet) is designed to combine structural pyramid processing and feature pyramid processing. In order to efficiently distribute parameters and computations among all structural pyramid levels, the proposed StruPyNet flexibly cascades more small flow estimators at lower structural pyramid levels and less small flow estimators at higher structural pyramid levels. The more focus on low-resolution feature matching facilitates large-motion flow estimation and background flow estimation. In addition, residual flow learning, feature warping and cost volume construction are repeatedly performed by the cascaded small flow estimators, which benefits training and estimation. As compared with state-of-the-art convolutional neural network-based methods, StruPyNet achieves better performances in most cases. The model size of StruPyNet is 95.9% smaller than FlowNet2, and StruPyNet runs at 3 more frames per second than FlowNet2. Moreover, the experimental results on several benchmark datasets demonstrate the effectiveness of StruPyNet, especially StruPyNet performs quite well for large-motion estimation.

Zefeng Sun, Hanli Wang, Yun Yi, Qinyu Li

### Real-Time Multiple Pedestrians Tracking in Multi-camera System

This paper proposes a novel 3D realtime multi-view multi-target tracking framework featured by a global-to-local tracking by detection model and a 3D tracklet-to-tracklet data association scheme. In order to enable the realtime performance, the former maximizes the utilization of temporal differences between consecutive frames, resulting in a great reduction of the average frame-wise detection time. Meanwhile, the latter accurately performs multi-target data association across multiple views to calculate the 3D trajectories of tracked objects. Comprehensive experiments on PETS-09 dataset well demonstrate the outstanding performance of the proposed method in terms of efficiency and accuracy in multi-target 3d trajectory tracking tasks, compared to the state-of-the art methods.

Muchun Chen, Yugang Chen, Truong Tan Loc, Bingbing Ni

### Learning Multi-feature Based Spatially Regularized and Scale Adaptive Correlation Filters for Visual Tracking

Visual object tracking is a challenging problem in computer vision. Although the correlation filter-based trackers have achieved competitive results both on accuracy and robustness in recent years, there is still a need to improve the overall tracking performance. In this paper, to tackle the problems caused by Spatially Regularized Discriminative Correlation Filter (SRDCF), we suggest a single-sample-integrated training scheme which utilizes information of the previous frames and the current frame to improve the robustness of training samples. Moreover, manually designed features and deep convolutional features are integrated together to further boost the overall tracking capacity. To optimize the translation filter, we develop an alternating direction method of multipliers (ADMM) algorithm. Besides, we introduce a scale adaptively filter to directly learn the appearance changes induced by variations in the target scale. Extensive empirical evaluations on the TB-50, TB-100 and OTB-2013 datasets demonstrate that the proposed tracker is very promising for various challenging scenarios.

Ying She, Yang Yi

### Unsupervised Video Summarization via Attention-Driven Adversarial Learning

This paper presents a new video summarization approach that integrates an attention mechanism to identify the significant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we first develop an improved version of it (called SUM-GAN-sl) that has a significantly reduced number of learned parameters, performs incremental training of the model’s components, and applies a stepwise label-based strategy for updating the adversarial part. Subsequently, we introduce an attention mechanism to SUM-GAN-sl in two ways: (i) by integrating an attention layer within the variational auto-encoder (VAE) of the architecture (SUM-GAN-VAAE), and (ii) by replacing the VAE with a deterministic attention auto-encoder (SUM-GAN-AAE). Experimental evaluation on two datasets (SumMe and TVSum) documents the contribution of the attention auto-encoder to faster and more stable training of the model, resulting in a significant performance improvement with respect to the original model and demonstrating the competitiveness of the proposed SUM-GAN-AAE against the state of the art (Software publicly available at: https://github.com/e-apostolidis/SUM-GAN-AAE ).

Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, Ioannis Patras

### Efficient HEVC Downscale Transcoding Based on Coding Unit Information Mapping

In this paper, a novel method that utilizes the information of coding unit (CU) from source video to accelerate the downscale transcoding process for High Efficiency Video Coding (HEVC) is proposed. Specifically, the CU depth and prediction mode information are first extracted into matrices in block level according to the decoded source video. Then we use the matrices to predict CU depth and prediction mode at CU level in the target video. Finally, some effective rules are introduced to determine CU partition and prediction mode based on the prediction. Our approach supports the spatial downscale transcoding with any spatial resolution downscaling ratio. Experiments show that the proposed method can achieve an average time reduction of 59.3% compared to the reference HEVC encoder, with a relatively small Bjøntegaard Delta Bit rate (BDBR) increment on average. Moreover, our approach is also competitive compared to the state-of-the-art spatial resolution downscale transcoding methods for HEVC.

Zhijie Huang, Yunchang Li, Jun Sun

### Fine-Grain Level Sports Video Search Engine

It becomes an urgent demand how to make people find relevant video content of interest from massive sports videos. We have designed and developed a sports video search engine based on distributed architecture, which aims to provide users with content-based video analysis and retrieval services. In sports video search engine, we focus on event detection, highlights analysis and image retrieval. Our work has several advantages: (I) CNN and RNN are used to extract features and integrate dynamic information and a new sliding window model are used for multi-length event detection. (II) For highlights analysis. An improved method based on self-adapting dual threshold and dominant color percentage are used to detect the shot boundary. Affect arousal method are used for highlights extraction. (III) For image’s indexing and retrieval. Hyper-spherical soft assignment method is proposed to generate image descriptor. Enhanced residual vector quantization is presented to construct multi-inverted index. Two adaptive retrieval methods based on hype-spherical filtration are used to improve the time efficient. (IV) All of previous algorithms are implemented in the distributed platform which we develop for massive video data processing.

Zikai Song, Junqing Yu, Hengyou Cai, Yangliu Hu, Yi-Ping Phoebe Chen

### The Korean Sign Language Dataset for Action Recognition

Recently, the development of computer vision technologies has shown excellent performance in complex tasks such as behavioral recognition. Therefore, several studies propose datasets for behavior recognition, including sign language recognition. In many countries, researchers are carrying out studies to automatically recognize and interpret sign language to facilitate communication with deaf people. However, there is no dataset aiming at sign language recognition that is used in Korea yet, and research on this is insufficient. Since sign language varies from country to country, it is valuable to build a dataset for Korean sign language. Therefore, this paper aims to propose a dataset of videos of isolated signs from Korean sign language that can also be used for behavior recognition using deep learning. We present the Korean Sign Language (KSL) dataset. The dataset is composed of 77 words of Korean sign language video clips conducted by 20 deaf people. We train and evaluate this dataset in deep learning networks that have recently achieved excellent performance in the behavior recognition task. Also, we have confirmed through the deconvolution-based visualization method that the deep learning network fully understands the characteristics of the dataset.

Seunghan Yang, Seungjun Jung, Heekwang Kang, Changick Kim

### SEE-LPR: A Semantic Segmentation Based End-to-End System for Unconstrained License Plate Detection and Recognition

Most previous works regard License Plate detection and Recognition (LPR) as two or more separate tasks, which often leads to error accumulation and low efficiency. Recently, several new studies use end-to-end training to overcome these problems and achieve better results. However, challenges like misalignment and variable-length or multi-language LPs still exist. In this paper, we propose a novel Semantic segmentation based End-to-End multilingual LPR system SEE-LPR to solve these challenges. Our system has four components which are convolution backbone, LP capture, LP alignment, and LP recognition. Specifically, LP alignment is used to connect LP capture and LP recognition, allowing the gradient back-propagate through the whole network and can handle oblique LPs. Connectionist Temporal Classification (CTC) module used in LP recognition makes our system able to handle LPs with variable-length or multi-language. Comparative studies on several challenging benchmark datasets show that the proposed SEE-LPR system significantly outperforms the state-of-the-art systems in both accuracy and efficiency.

Dongqi Tang, Hao Kong, Xi Meng, Ruo-Ze Liu, Tong Lu

### Action Co-localization in an Untrimmed Video by Graph Neural Networks

We present an efficient approach for action co-localization in an untrimmed video by exploiting contextual and temporal feature from multiple action proposals. Most existing action localization methods focus on each individual action instances without accounting for the correlations among them. To exploit such correlations, we propose the Graph-based Temporal Action Co-Localization (G-TACL) method, which aggregates contextual features from multiple action proposals to assist temporal localization. This aggregation procedure is achieved with Graph Neural Networks with nodes initialized by the action proposal representations. In addition, a multi-level consistency evaluator is proposed to measure the similarity, which summarizes low-level temporal coincidences, features vector dot products and high-level contextual features similarities between any two proposals. Subsequently, these nodes are iteratively updated with Gated Recurrent Unit (GRU) and the obtained node features are used to regress the temporal boundaries of the action proposals, and finally to localize the action instances. Experiments on the THUMOS’14 and MEXaction2 datasets have demonstrated the efficacy of our proposed method.

Changbo Zhai, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, Nanning Zheng, Gang Hua

### A Novel Attention Enhanced Dense Network for Image Super-Resolution

Deep convolutional neural networks (CNNs) have recently achieved impressive performance in image super-resolution (SR). However, they usually treat the spatial features and channel-wise features indiscriminatingly and fail to take full advantage of hierarchical features, restricting adaptive ability. To address these issues, we propose a novel attention enhanced dense network (AEDN) to adaptively recalibrate each kernel and feature for different inputs, by integrating both spatial attention (SA) and channel attention (CA) modules in the proposed network. In experiments, we explore the effect of attention mechanism and present quantitative and qualitative evaluations, where the results show that the proposed AEDN outperforms state-of-the-art methods by effectively suppressing the artifacts and faithfully recovering more high-frequency image details.

Zhong-Han Niu, Yang-Hao Zhou, Yu-Bin Yang, Jian-Cong Fan

### Marine Biometric Recognition Algorithm Based on YOLOv3-GAN Network

With the rise of the marine ranching field, the object recognition applications on underwater catching robots have become more and more popular. However, due to the influence of uneven underwater light, the underwater images are easily encountered with problems such as color distortion and underexposure, which seriously affects the accuracy of underwater object recognition. In this work, we propose a marine biometric recognition algorithm based on YOLOv3-GAN network, which jointly optimizes the training of image enhancement loss (LossGAN) and classification and location loss (LossYOLO) in the network, and it is different from the traditional underwater object recognition approaches which usually consider image enhancement and object detection separately. Consequently, our proposed algorithm is more powerful in marine biological identification. Moreover, the anchor box is further modified by k-means method to cluster the object box size in the network detection part, which makes the anchor box more in line with the object size. The experimental results demonstrate that the mean Average Precision (mAP) and the Recall of the YOLOv3-GAN network are above 6.4% and 4.8% higher than that of the YOLOv3 network. In addition, the image enhancement part in the YOLOv3-GAN network can provide high quality images which benefit for other applications in the marine surveillance field.

Ping Liu, Hongbo Yang, Jingnan Fu

### Multi-scale Spatial Location Preference for Semantic Segmentation

This paper proposes a semantic segmentation network which can address the problem of adaptive segmentation for objects with different sizes. In this work, ResNetV2-50 is firstly exploited to extract features of objects, and then these features are fed into the reconstructed feature pyramid network (FPN), which includes multi-scale preference (MSP) module and multi-location preference (MLP) module. Aiming at objects with different sizes, the receptive fields of kernels need to be adjusted. MSP module concatenates feature maps of different receptive fields, and then combines them with the SE block in SE-Net to obtain scale-wise dependencies. In this way, not only multi-scale information can be encoded in feature maps with different degree levels of preference adaptively, but also multi-scale spatial information can be provided to MLP module. The MLP module combines the channels containing more accurate spatial location information with preference to replace traditional nearest interpolation upsampling in FPN. At last, the weighted channels equip with scale-wise information as well as more accurate spatial location information and yield precise semantic prediction for objects with different sizes. We demonstrate the effectiveness of the proposed solutions on the Cityscapes and PASCAL VOC 2012 semantic image segmentation datasets and our methods achieve comparable or higher performance.

Qiuyuan Han, Jin Zheng

### HRTF Representation with Convolutional Auto-encoder

The head-related transfer function (HRTF) can be considered as some kind of filter that describes how a sound from an arbitrary spatial direction transfers to the listener’s eardrums. HRTF can be used to synthesize vivid virtual 3D sound that seems to come from any spatial location, which makes it play an important role in the 3D audio technology. However, the complexity and variation of auditory cues inherent in HRTF make it difficult to set up an accurate mathematical model with the conventional methods. In this paper, we put forward an HRTF representation modeling based on convolutional auto-encoder (CAE), which is some type of auto-encoder that contains convolutional layers in the encoder part and deconvolution layers in the decoder part. The experimental evaluation on the ARI HRTF database shows that the proposed model provides very good results on dimensionality reduction of HRTF.

Wei Chen, Ruimin Hu, Xiaochen Wang, Dengshi Li

### Unsupervised Feature Propagation for Fast Video Object Detection Using Generative Adversarial Networks

We propose unsupervised Feature Propagation Generative Adversarial Network (denoted as FPGAN) for fast video object detection in this paper. In our video object detector, we detect objects on spare key frames using pre-trained state-of-the-art object detector R-FCN, and propagate CNN features to adjacent frames for fast detection via a light-weight transformation network. To learn the feature propagation network, we make full use of unlabeled video data and employ generative adversarial networks in model training. Specifically, in FPGAN, the generator is the feature propagation network, and the discriminator employs second-order temporal coherence and 3D ConvNets to distinguish between predicted and “ground truth” CNN features. In addition, Euclidean distance loss provided by the pre-trained image object detector is also adopted to jointly supervise the learning. Our method doesn’t need any human labelling in videos. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of our method.

Xuan Zhang, Guangxing Han, Wenduo He

### OmniEyes: Analysis and Synthesis of Artistically Painted Eyes

Faces in artistic paintings most often contain the same elements (eyes, nose, mouth...) as faces in the real world, however they are not a photo-realistic transfer of physical visual content. These creative nuances the artists introduce in their work act as interference when facial detection models are used in the artistic domain. In this work we introduce models that can accurately detect, classify and conditionally generate artistically painted eyes in portrait paintings. In addition, we introduce the OmniEyes Dataset that captures the essence of painted eyes with annotated patches from 250 K artistic paintings and their metadata. We evaluate our approach in inpainting, out of context eye generation and classification on portrait paintings from the OmniArt dataset. We conduct a user case study to further study the quality of our generated samples, asses their aesthetic aspects and provide quantitative and qualitative results for our model’s performance.

Gjorgji Strezoski, Rogier Knoester, Nanne van Noord, Marcel Worring

### LDSNE: Learning Structural Network Embeddings by Encoding Local Distances

Network embedding algorithms learn low-dimensional features from the relationships and attributes of networks. The basic principle of these algorithms is to preserve the similarities in the original networks as much as possible. However, existing algorithms are not expressive enough for structural identity similarities. Therefore, we propose LDSNE, a novel algorithm for learning structural representations in both directed and undirected networks. Networks are first mapped into a proximity-based low-dimension space. Then, structural embeddings are extracted by encoding local space distances. Empirical results demonstrate that our algorithm can obtain multiple types of representations and outperforms other state-of-the-art methods.

Xiyue Gao, Jun Chen, Jing Yao, Wenqian Zhu

### FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks

Deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling. In this paper we propose several improvements of TCN for end-to-end approach to monaural speech separation, which consists of (1) multi-scale dynamic weighted gated TCN with a pyramidal structure (FurcaPy), (2) gated TCN with intra-parallel convolutional components (FurcaPa), (3) weight-shared multi-scale gated TCN (FurcaSh) and (4) dilated TCN with gated subtractive-convolutional component (FurcaSu). All these networks take the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker’s voice. For the objective, we propose to train the networks by directly optimizing utterance-level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus result in 18.4 dB SDR improvement, which shows our proposed networks can lead to performance improvement on the speaker separation task.

Liwen Zhang, Ziqiang Shi, Jiqing Han, Anyan Shi, Ding Ma

### Multi-step Coding Structure of Spatial Audio Object Coding

The spatial audio object coding (SAOC) is an effective meth-od which compresses multiple audio objects and provides flexibility for personalized rendering in interactive services. It divides each frame signal into 28 sub-bands and extracts one set object spatial parameters for each sub-band. Objects can be coded into a downmix signal and a few parameters by this way. However, using same parameters in one sub-band will cause frequency aliasing distortion, which seriously impacts listening experience. Existing studies to improve SAOC cannot guarantee that all audio objects can be decoded well. This paper describes a new multi-step object coding structure to efficient calculate residual of each object as additional side information to compensate the aliasing distortion of each object. In this multi-step structure, a sorting strategy based on sub-band energy of each object is proposed to determine which audio object should be encoded in each step, because the object encoding order will affect the final decoded quality. The singular value decomposition (SVD) is used to reduce the increasing bit-rate due to the added side information. From the experiment results, the performance of proposed method is better than SAOC and SAOC-TSC, and each object can be decoded well with respect to the bit-rate and the sound quality.

Chenhao Hu, Ruimin Hu, Xiaochen Wang, Tingzhao Wu, Dengshi Li

### Thermal Face Recognition Based on Transformation by Residual U-Net and Pixel Shuffle Upsampling

We present a thermal face recognition system that first transforms the given face in the thermal spectrum into the visible spectrum, and then recognizes the transformed face by matching it with the face gallery. To achieve high-fidelity transformation, the U-Net structure with a residual network backbone is developed for generating visible face images from thermal face images. Our work mainly improves upon previous works on the Nagoya University thermal face dataset. In the evaluation, we show that the rank-1 recognition accuracy can be improved by more than $$10\%$$. The improvement on visual quality of transformed faces is also measured in terms of PSNR (with 0.36 dB improvement) and SSIM (with 0.07 improvement).

Soumya Chatterjee, Wei-Ta Chu

### K-SVD Based Point Cloud Coding for RGB-D Video Compression Using 3D Super-Point Clustering

In this paper, we present a novel 3D structure-awareness RGB-D video compression scheme, which applies the proposed 3D super-point clustering to partition the super-points in a colored point cloud, generated from an RGB-D image, into a centroid and a non-centroid super-point datasets. A super-point is a set of 3D points which are characterized with similar feature vectors. Input an RGB-D frame to the proposed scheme, the camera parameters are first used to generate a colored point cloud, which is segmented into multiple super-points using our multiple principal plane analysis (MPPA). These super-points are then grouped into multiple clusters, each of them characterized by a centroid super-point. Next, the median feature vectors of super-points are represented by the K singular value decomposition (K-SVD) based sparse codes. Given a super-point cluster, the sparse codes of the median feature vectors are very similar and thus the redundant information among them are easy to remove by the successive entropy coding. For each super-point, the residual super-point is computed by subtracting the feature vectors inside from the reconstructed median feature vector. These residual feature vectors are also collected and coded using the K-SVD based sparse coding to enhance the quality of the compressed point cloud. This process results in a multiple description coding scheme for 3D point cloud compression. Finally, the compressed point cloud is projected to the 2D image space to obtain the compressed RGB-D image. Experiments demonstrate the effectiveness of our approach which attains better performance than the current state-of-the-art point cloud compression methods.

Shyi-Chyi Cheng, Ting-Lan Lin, Ping-Yuan Tseng

### Resolution Booster: Global Structure Preserving Stitching Method for Ultra-High Resolution Image Translation

Current image translation networks (for instance image to image translation, style transfer et al.) have strict limitation on input image resolution due to high spatial complexity, which results in a wide gap to their usage in practice. In this paper we propose a novel patch-based auxiliary architecture, called Resolution Booster, to endow a trained image translation network ability to process ultra high resolution images. Different from previous methods which compute the results with the entire image, our network processes resized global image at low resolution as well as high-resolution local patches to save the memory. To increase the quality of generated image, we exploit the rough global information with global branch and high resolution information with local branch then combine the results with a designed reconstruction network. Then a joint global/local stitching result is produced. Our network is flexible to be deployed on any exiting image translation method to endow the new network to process larger images. Experimental results show the both capability of processing much higher resolution images while not decreasing the generating quality compared with baseline methods and generality of our model for flexible deployment.

Siying Zhai, Xiwei Hu, Xuanhong Chen, Bingbing Ni, Wenjun Zhang

### Cross Fusion for Egocentric Interactive Action Recognition

The characteristics of egocentric interactive videos, which include heavy ego-motion, frequent viewpoint changes and multiple types of activities, hinder the action recognition methods of third-person vision from obtaining satisfactory results. In this paper, we introduce an effective architecture with two branches and a cross fusion method for action recognition in egocentric interactive vision. The two branches are responsible to model the information from observers and inter-actors respectively, and each branch is designed based on the multimodal multi-stream C3D networks. We leverage cross fusion to establish effective linkages between the two branches, which aims to reduce redundant information and fuse complementary features. Besides, we propose variable sampling to obtain discriminative snippets for training. Experimental results demonstrate that the proposed architecture achieves superior performance over several state-of-the-art methods on two benchmarks.

Haiyu Jiang, Yan Song, Jiang He, Xiangbo Shu

### Improving Brain Tumor Segmentation with Dilated Pseudo-3D Convolution and Multi-direction Fusion

Convolutional neural networks have shown their dominance in many computer vision tasks and been broadly used for medical image analysis. Unlike traditional image-based tasks, medical data is often in 3D form. 2D Networks designed for images shows poor performance and efficiency on these tasks. Although 3D networks work better, their computation and memory cost are rather high. To solve this problem, we decompose 3D convolution to decouple volumetric information, in the same way human experts treat volume data. Inspired by the concept of three medically-defined planes, we further propose a Multi-direction Fusion (MF) module, using three branches of this factorized 3D convolution in parallel to simultaneously extract features from three different directions and assemble them together. Moreover, we suggest introducing dilated convolution to preserve resolution and enlarge receptive field for segmentation. The network with proposed modules (MFNet) achieves competitive performance with other state-of-the-art methods on BraTS 2018 brain tumor segmentation task and is much more light-weight. We believe this is an effective and efficient way for volume-based medical segmentation.

Sun’ao Liu, Hai Xu, Yizhi Liu, Hongtao Xie

### Texture-Based Fast CU Size Decision and Intra Mode Decision Algorithm for VVC

Versatile Video Coding (VVC) is the next generation video coding standard. Compared with HEVC/H.265, in order to improve coding efficiency, its complexity of intra coding increases significantly. Too much encoding time makes it difficult for real-time coding and hardware implementation. To tackle this urgent problem, a texture-based fast CU size decision and intra mode decision algorithm is proposed in this paper. The contributions of the proposed algorithm include two aspects. (1) Aiming at the latest adopted QTMT block partitioning scheme in VVC, some redundant splitting tree and direction are skipped according to texture characteristics and differences between sub-parts of CU. (2) A fast intra mode decision scheme is proposed which considers complexity and texture characteristics. Some hierarchical modifications are applied including reducing the number of checked Intra Prediction Modes (IPM) and candidate modes in Rough Modes Decision (RMD) and RD check stage respectively. Compared with the latest VVC reference software VTM-4.2, the proposed algorithm can achieve approximately 46% encoding time saving on average with only 0.91% BD-RATE increase or 0.046 BD-PSNR decrease.

Jian Cao, Na Tang, Jun Wang, Fan Liang

### An Efficient Hierarchical Near-Duplicate Video Detection Algorithm Based on Deep Semantic Features

With the rapid development of the Internet and multimedia technology, the amount of multimedia data on the Internet is escalating exponentially, which has attracted much research attention in the field of Near-Duplicate Video Detection (NDVD). Motivated by the excellent performance of Convolutional Neural Networks (CNNs) in image classification, we bring the powerful discrimination ability of the CNN model to the NDVD system and propose a hierarchical detection method based on the derived deep semantic features from the CNN models. The original CNN features are firstly extracted from the video frames, and then a semantic descriptor and a labels descriptor are obtained respectively based on the basic content unit. Finally, a hierarchical matching scheme is proposed to promote fast near-duplicate video detection. The proposed approach has been tested on the widely used CC_WEB_VIDEO dataset, and has achieved state-of-the-art results with the mean Average Precision (mAP) of 0.977.

Siying Liang, Ping Wang

### Meta Transfer Learning for Adaptive Vehicle Tracking in UAV Videos

The vehicle tracking in UAV videos is still under-explored with the deep learning methods due to the lack of well labeled datasets. The challenges mainly come from the fact that the UAV view has much wider and changeable landscapes, which hinders the labeling task. In this paper, we propose a meta transfer learning method for adaptive vehicle tracking in UAV videos (MTAVT), which transfers the common features across landscapes, so that it can avoid over-fitting with the limited scale of dataset. Our MTAVT consists of two critical components: a meta learner and a transfer learner. Specifically, meta-learner is employed to adaptively learn the models to extract the sharing features between ground and drone views. The transfer learner is used to learn the domain-shifted features from ground-view to drone-view datasets by optimizing the ground-view models. We further seamlessly incorporate an exemplar-memory curriculum into meta learning by leveraging the memorized models, which serves as the training guidance for sequential sampling. Hence, this curriculum can enforce the meta learner to adapt to the new sequences in the drone-view datasets without losing the previous learned knowledge. Meanwhile, we simplify and stabilize the higher-order gradient training criteria for meta learning by exploring curriculum learning in multiple stages with various domains. We conduct extensive experiments and ablation studies on four public benchmarks and an evaluation dataset from YouTube (to release soon). All the experiments demonstrate that, our MTAVT has superior advantages over state-of-the-art methods in terms of accuracy, robustness, and versatility.

Wenfeng Song, Shuai Li, Yuting Guo, Shaoqi Li, Aimin Hao, Hong Qin, Qinping Zhao

### Adversarial Query-by-Image Video Retrieval Based on Attention Mechanism

The query-by-image video retrieval (QBIVR) is a difficult feature matching task across different modalities. More and more retrieval tasks require indexing the videos containing the activities in the image, which makes extracting meaningful spatio-temporal video features crucial. In this paper, we propose an approach based on adversarial learning, termed Adversarial Image-to-Video (AIV) approach. To capture the temporal pattern of videos, we utilize temporal regions likely to contain activities via fully-convolutional 3D ConvNet features, and then obtain the video bag features by 3D RoI Pooling. To solve mismatch issue with image vector features and identify the importances of information for videos, we add a Multiple Instance Learning (MIL) module to assign different weights to each temporal information in video bags. Moreover, we utilize the triplet loss to distinguish different semantic categorites and support intraclass variability of images and videos. Specially, our AIV proposes modality loss as an adversary to the triplet loss in the adversarial learning. The interplay between two losses jointly bridges the domain gap across different modalities. Extensive experiments on two widely used datasets verify the effectiveness of our proposed methods as compared with other methods.

Ruicong Xu, Li Niu, Liqing Zhang

### Joint Sketch-Attribute Learning for Fine-Grained Face Synthesis

The photorealism of synthetic face images has been significantly improved by generative adversarial networks (GANs). Besides of the realism, more accurate control on the properties of face images. While sketches convey the desired shapes, attributes describe appearance. However, it remains challenging to jointly exploit sketches and attributes, which are in different modalities, to generate high-resolution photorealistic face images. In this paper, we propose a novel joint sketch-attribute learning approach to synthesize photo-realistic face images with conditional GANs. A hybrid generator is proposed to learn a unified embedding of shape from sketches and appearance from attributes for synthesizing images. We propose an attribute modulation module, which transfers user-preferred attributes to reinforce sketch representation with appearance details. Using the proposed approach, users could flexibly manipulate the desired shape and appearance of synthesized face images with fine-grained control. We conducted extensive experiments on the CelebA-HQ dataset [16]. The experimental results have demonstrated the effectiveness of the proposed approach.

Binxin Yang, Xuejin Chen, Richang Hong, Zihan Chen, Yuhang Li, Zheng-Jun Zha

### High Accuracy Perceptual Video Hashing via Low-Rank Decomposition and DWT

In this work, we propose a novel robust video hashing algorithm with High Accuracy. The proposed algorithm generates a fix-up hash via low-rank and sparse decomposition and discrete wavelet transform (DWT). Specifically, input video is converted to randomly normalized video with logistic map, and then content-based feature matrices extract from a randomly normalized video with low-rank and sparse decomposition. Finally, data compression with 2D-DWT of LL sub-band is applied to feature matrices and statistic properties of DWT coefficients are quantized to derive a compact video hash. Experiments with 4760 videos are carried out to validate efficiency of the proposed video hashing. The results show that the proposed video hashing is robust to many digital operations and reaches good discrimination. Receiver operating characteristic (ROC) curve comparisons indicate that the proposed video hashing more desirable performance than some algorithms in classification between robustness and discrimination.

Lv Chen, Dengpan Ye, Shunzhi Jiang

### HMM-Based Person Re-identification in Large-Scale Open Scenario

This paper aims to tackle person re-identification (person re-ID) in large-scale open scenario, which differs from the conventional person re-ID tasks but is significant for some real suspect investigation cases. In the large-scale open scenario, the image background and person appearance may change immensely. There are a large number of irrelevant pedestrians appearing in the urban surveillance systems, some of which may have very similar appearance with the target person. Existing methods utilize only surveillance video information, which can not solve the problem well due to above challenges. In this paper, we explore that pedestrians’ paths from multiple spaces (such as surveillance space and geospatial space) are matched due to temporal-spatial consistency. Moreover, people have their unique behavior path due to the differences of individual behavioral. Inspired by these two observations, we propose to use the association relationship of paths from surveillance space and geospatial space to solve the person re-ID in large-scale open scenario. A Hidden Markov Model based Path Association(HMM-PA) framework is presented to jointly analyze image path and geospatial path. In addition, according to our research scenario, we manually annotate path description on two large-scale public re-ID datasets, termed as Duke-PDD and Market-PDD. Comprehensive experiments on these two datasets show proposed HMM-PA outperforms the state-of-art methods.

Dongyang Li, Ruimin Hu, Wenxin Huang, Xiaochen Wang, Dengshi Li, Fei Zheng

### No Reference Image Quality Assessment by Information Decomposition

No reference (NR) image quality assessment (IQA) is to automatically assess image quality as would be perceived by human without reference images. Currently, almost all state-of-the-art NR IQA approaches are trained and tested on the databases of synthetically distorted images. The synthetically distorted images are usually produced by superimposing one or several common distortions on the clean image, but the authentically distorted images are often simultaneously contaminated by several unknown distortions. Therefore, most IQA performances will greatly drop on the authentically distorted images. Recent researches on the human brain demonstrate that the human visual system (HVS) perceives image scenes by predicting the primary information and avoiding residual uncertainty. According to this theory, a new and robust NR IQA approach is proposed in this paper. By the proposed approach, the distorted image is decomposed into the orderly part and disorderly part to be separately processed as its primary information and uncertainty information. Global features of the distorted image are also calculated to describe the overall image contents. Experimental results on the synthetically and authentically image databases demonstrate that the proposed approach makes great progress in IQA performance.

Junchen Deng, Ci Wang, Shiqi Liu