Best Paper Session

Frontmatter

Real-time Detection of Tiny Objects Based on a Weighted Bi-directional FPN

Tiny object detection is an important and challenging object detection subfield. However, many of its numerous applications (e.g., human tracking and marine rescue) have tight detection time constraints. Namely, two-stage object detectors are too slow to fulfill the real-time detection needs, whereas one-stage object detectors have an insufficient detection accuracy. Consequently, enhancing the detection accuracy of one-stage object detectors has become an essential aspect of real-time tiny objects detection. This work presents a novel model for real-time tiny objects detection based on a one-stage object detector YOLOv5. The proposed YOLO-P4 model contains a module for detecting tiny objects and a new output prediction branch. Next, a weighted bi-directional feature pyramid network (BiFPN) is introduced in YOLO-P4, yielding an improved model named YOLO-BiP4 that enhances the YOLO-P4 feature input branches. The proposed models were tested on the Tiny-Person dataset, demonstrating that the YOLO-BiP4 model outperforms the original model in detecting tiny objects. The model satisfies the real-time detection needs while obtaining the highest accuracy compared to existing one-stage object detectors.

Yaxuan Hu, Yuehong Dai, Zhongxiang Wang

Multi-modal Fusion Network for Rumor Detection with Texts and Images

Currently, more and more individuals tend to publish texts and images on social media to express their views. Inevitably, social media platform has become a media for a large number of rumors. There are a few studies on multi-modal rumor detection. However, most of them simplified the fusion strategy of texts and images and ignored the rich knowledge behind images. To address the above issues, this paper proposes a Multi-Modal Model with Texts and Images (M $$^3$$ 3 TI) for rumor detection. Specifically, its Granularity-fusion Module (GM) learns the multi-modal representation of the tweet according to the relevance of images and texts instead of the simple concatenation fusion strategy, while its Knowledge-aware Module (KM) retrieves image knowledge through the advanced recognition method to complement the semantic representation of image. Experimental results on two datasets (English PHEME and Chinese WeiBo) show that our model M $$^3$$ 3 TI is more effective than several state-of-the-art baselines.

Boqun Li, Zhong Qian, Peifeng Li, Qiaoming Zhu

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network

Image-based virtual try-on aims to transfer a target clothes onto a person has attracted increased attention. However, the existing methods are heavily based on accurate parsing results. It remains a big challenge to generate highly-realistic try-on images without human parser. To address this issue, we propose a new Parser-Free Virtual Try-On Network (PF-VTON), which is able to synthesize high-quality try-on images without relying on human parser. Compared to prior arts, we introduce two key innovations. One is that we introduce a new twice geometric matching module, which warps the pixels of the target clothes and the features of the preliminary warped clothes to obtain the final warped clothes with realistic texture and robust alignment. The other is that we design a new U-Transformer, which is highly effective for generating highly-realistic images in try-on synthesis. Extensive experiments show that our system outperforms the state-of-the-art methods both qualitatively and quantitatively.

Yuan Chang, Tao Peng, Ruhan He, Xinrong Hu, Junping Liu, Zili Zhang, Minghua Jiang

MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis

The performance of text-to-image synthesis has been significantly boosted accompanied by the development of generative adversarial network (GAN) techniques. The current GAN-based methods for text-to-image generation mainly adopt multiple generator-discriminator pairs to explore the coarse/fine-grained textual content (e.g., words and sentences); however, they only consider the semantic consistency between the text-image pair. One drawback of such a multi-stream structure is that it results in many heavyweight models. In comparison, the single-stream counterpart bears the weakness of insufficient use of texts. To alleviate the above problems, we propose a Multi-conditional Fusion GAN (MF-GAN) to reap the benefits of both the multi-stream and the single-stream methods. MF-GAN is a single-stream model but achieves the utilization of both coarse and fine-grained textual information with the use of conditional residual block and dual attention block. More specifically, the sentence and word features are repeatedly inputted into different model stages for textual information enhancement. Furthermore, we introduce a triple loss to close the visual gap between the synthesized image and its positive image and enlarge the gap to its negative image. To thoroughly verify our method, we conduct extensive experiments on two benchmarked CUB and COCO datasets. Experimental results show that the proposed MF-GAN outperforms the state-of-the-art methods.

Yuyan Yang, Xin Ni, Yanbin Hao, Chenyu Liu, Wenshan Wang, Yifeng Liu, Haiyong Xie

Applications 1

Frontmatter

Learning to Classify Weather Conditions from Single Images Without Labels

Weather classification from single images plays an important role in many outdoor computer vision applications, while it has not been thoroughly studied. Despite existing methods have achieved great success under the supervision of weather labels, they are hardly applicable to real-world applications due to the reliance on extensive human-annotated data. In this paper, we make the first attempt to view weather classification as an unsupervised task, i.e., classifying weather conditions from single images without labels. Specifically, a two-step unsupervised approach, where weather feature learning and weather clustering are decoupled, is proposed to automatically group images into weather clusters. In weather feature learning, we employ a self-supervised task to learn the semantically meaningful weather features. To ensure weather features invariant to image transformations and extract discriminative weather features, we also introduce online triplet mining into the task. In weather clustering, a learnable clustering method is designed by mining the nearest neighbors as a prior and enforcing the consistent predictions of each image and it’s nearest neighbors. Experimental results on two public benchmark datasets indicate that our approach achieves promising performance.

Kezhen Xie, Lei Huang, Wenfeng Zhang, Qibing Qin, Zhiqiang Wei

Learning Image Representation via Attribute-Aware Attention Networks for Fashion Classification

Attribute descriptions enrich the characteristics of fashion products, and they play an essential role in fashion image research. We propose a fashion classification model (M2Fashion) based on multi-modal data (text + image). It uses the intra-modal and inter-modal data correlation to locate relevant image regions under the guidance of attributes and the attention mechanism. Compared with traditional single-modal feature representation, learning embedding from multi-modal features can better reflect fine-grained image features. We adopt a multi-task learning framework that combines category classification and attribute prediction tasks. The extensive experimental result on the public dataset DeepFashion shows the superiority of our proposed M2Fashion compared with state-of-the-art methods. It achieves +1.3% top-3 accuracy rate improvement in the category classification task and +5.6%/+3.7% top-3 recall rate improvement in the attribute prediction of part/shape, respectively. A supplementary experiment on attribute-specific image retrieval on the DARN dataset also demonstrates the effectiveness of M2Fashion.

Yongquan Wan, Cairong Yan, Bofeng Zhang, Guobing Zou

Toward Detail-Oriented Image-Based Virtual Try-On with Arbitrary Poses

Image-based virtual try-on with arbitrary poses has attracted many attentions recently. The purpose of this study is to synthesize a reference person image wearing a target clothes with a target pose. However, it is still a challenge for the existing methods to preserve the clothing details and person identity while generating fine-grained try-on images. To resolve the issues, we present a new detail-oriented virtual try-on network with arbitrary poses (DO-VTON). Specifically, our DO-VTON consists of three major modules: First, a semantic prediction module adopts a two-stage strategy to gradually predict a semantic map of the reference person. Second, a spatial alignment module warps the target clothes and non-target details to align with the target pose. Third, a try-on synthesis module generates final try-on images. Moreover, to generate high-quality images, we introduce a new multi-scale dilated convolution U-Net to enlarge the receptive field and capture context information. Extensive experiments on two famous benchmark datasets demonstrate our system achieves the state-of-the-art virtual try-on performance both qualitatively and quantitatively.

Yuan Chang, Tao Peng, Ruhan He, Xinrong Hu, Junping Liu, Zili Zhang, Minghua Jiang

Parallel DBSCAN-Martingale Estimation of the Number of Concepts for Automatic Satellite Image Clustering

The necessity of organising big streams of Earth Observation (EO) data induces the efficient clustering of image patches, deriving from satellite imagery, into groups. Since the different concepts of the satellite image patches are not known a priori, DBSCAN-Martingale can be applied to estimate the number of the desired clusters. In this paper we provide a parallel version of the DBSCAN-Martingale algorithm and a framework for clustering EO data in an unsupervised way. The approach is evaluated on a benchmark dataset of Sentinel-2 images with ground-truth annotation and is also implemented on High Performance Computing (HPC) infrastructure to demonstrate its scalability. Finally, a cost-benefit analysis is conducted to find the optimal selection of reserved nodes for running the proposed algorithm, in relation to execution time and cost.

Ilias Gialampoukidis, Stelios Andreadis, Nick Pantelidis, Sameed Hayat, Li Zhong, Marios Bakratsas, Dennis Hoppe, Stefanos Vrochidis, Ioannis Kompatsiaris

Multimedia Applications - Perspectives, Tools and Applications (Special Session) and Brave New Ideas

Frontmatter

AI for the Media Industry: Application Potential and Automation Levels

Tools based on artificial intelligence (AI) are increasingly used in the media industry, addressing a potentially wide range of application areas. Based on a survey involving media professionals and technology providers, we present a taxonomy of application areas of AI in the media industry, including an assessment of the maturity of AI technology for the respective application. As many of these applications require human oversight, either due to insufficient maturity of technology or the need for editorial control, we also propose a classification of automation levels for AI in the media domain, with examples for different stages of the media value chain. Both of these aspects are strongly linked to the role of human users and their interaction with AI technologies. The results suggest that human-AI collaboration in media applications is still an unsolved research question.

Werner Bailer, Georg Thallinger, Verena Krawarik, Katharina Schell, Victoria Ertelthalner

Rating-Aware Self-Organizing Maps

Self-organizing maps (SOM) are one of the prominent paradigms for 2D data visualization. While aiming at preserving topological relations of high-dimensional data, they provide sufficiently organized view of objects and thus improve capability of users to explore displayed information. SOMs were also extensively utilized in visualizing results of multimedia information retrieval systems. However, for this task, SOM lacks the ability to adapt to the relevance scores induced by the underlying retrieval algorithm. Therefore, although exploration capability is enhanced, the capability to exploit the (best) results is severely limited. In order to cope with this problem, we propose a rating-aware modification of SOM algorithm that jointly optimizes for the preservation of both topological as well as relevance-based ordering of results.

Ladislav Peška, Jakub Lokoč

Color the Word: Leveraging Web Images for Machine Translation of Untranslatable Words

Automatic translation allows people around the globe to communicate with one another. However, state-of-the art machine translation is still unable to capture fine-grained meaning. This paper introduces the idea of using Web image selections in text-to-text translation, specifically for lacunae, which are words that do not have a translation in another language. We asked human professional translators to rank Google translate translations of lacunae in German and Dutch. We then compared that ranking with a ranking based on color histograms of Web image data of the words. We found there is viable potential in the idea of using images to address lacunae in the field of machine translation. We publicly release a dataset and our code for others to explore this potential. Finally, we provide an outlook on research directions that would allow this idea to be used in practice.

Yana van de Sande, Martha Larson

Activities and Events

Frontmatter

MGMP: Multimodal Graph Message Propagation Network for Event Detection

Multimodal event detection plays a pivotal role in social media analysis, yet remains challenging due to the large differences between images and texts, noisy contexts and the intricate correspondence of different modalities. To address these issues, we introduce a multimodal graph message propagation network (MGMP), a layer-wise approach that aggregates the multi-view context and integrates images and texts simultaneously. In particular, MGMP constructs visual and textual graphs and employs graph neural network (GNN) with an element-wise attention to propagate context and avoid transferring negative knowledge, and multimodal similarity propagation (MSP) follows to propagate complementarity for fusing images and texts. We evaluate MGMP on two public datasets, namely CrisisMMD and SED2014. Extensive experiments demonstrate the effectiveness and superiority of our method.

Jiankai Li, Yunhong Wang, Weixin Li

Pose-Enhanced Relation Feature for Action Recognition in Still Images

Due to the lack of motion information, action recognition in still images is considered a challenging task. Previous works focused on contextual information in the image, including human pose, surrounding objects, etc. But they rarely consider the relation between the local pose and the entire human body, so that poses related to the action are not fully utilized. In this paper, we propose a solution for action recognition in still images, which makes complete and effective use of pose information. The multi-key points calculation method is carefully designed for generating pose regions that explicitly includes possible actions. The extensible Pose-Enhanced Relation Module extracts the implicit relation between pose and human body, and outputs the Pose-Enhanced Relation Feature which owns powerful representation capabilities. Surrounding objects information is also applied to strengthen the solution. Through experiments, it can be found that the proposed solution exceed the state-of-the-art performance on two commonly used datasets, PASCAL VOC 2012 Action and Stanford 40 Actions. Visualization shows that the proposed solution enables the network to pay more attention to the pose regions related to the action.

Jiewen Wang, Shuang Liang

Prostate Segmentation of Ultrasound Images Based on Interpretable-Guided Mathematical Model

Ultrasound prostate segmentation is challenging due to the low contrast of transrectal ultrasound (TRUS) images and the presence of imaging artifacts such as speckle and shadow regions. In this work, we propose an improved principal curve-based & differential evolution-based ultrasound prostate segmentation method (H-SegMod) based on an interpretable-guided mathematical model. Comparing with existing related studies, H-SegMod has three main merits and contributions: (1) The characteristic of the principal curve on automatically approaching the center of the dataset is utilized by our proposed H-SegMod. (2) When acquiring the data sequences, we use the principal curve-based constraint closed polygonal segment model, which uses different initialization, normalization, and vertex filtering methods. (3) We propose a mathematical map model (realized by differential evolution-based neural network) to describe the smooth prostate contour represented by the output of neural network (i.e., optimized vertices) so that it can match the ground truth contour. Compared with the traditional differential evolution method, we add different mutation steps and loop constraint conditions. Both quantitative and qualitative evaluation studies on a clinical prostate dataset show that our method achieves better segmentation than many state-of-the-art methods.

Tao Peng, Caiyin Tang, Jing Wang

Spatiotemporal Perturbation Based Dynamic Consistency for Semi-supervised Temporal Action Detection

Temporal action detection usually relies on huge tagging costs to achieve significant performance. Semi-supervised learning, where only a small amount of data are annotated in the training set, can help reduce the burden of labeling. However, the existing action detection models will inevitably learn inductive bias from limited labeled data and hinder the effective use of unlabeled data in semi-supervised learning. To this end, we propose a generic end-to-end framework for Semi-Supervised Temporal Action Detection (SS-TAD). Specifically, the framework is based on the teacher-student structure that leverages the consistency between unlabeled data and their augmentations. To achieve this, we propose a dynamic consistency loss by employing an attention mechanism to alleviate the prediction bias of the model, so it can make full use of the unlabeled data. Besides, we design a concise yet valid spatiotemporal feature perturbation module to learn robust action representations. Experiments on THUMOS14 and ActivityNet v1.2 demonstrate that our method significantly outperforms the start-of-the-art semi-supervised methods and is even comparable to the fully-supervised methods.

Lin Wang, Yan Song, Rui Yan, Xiangbo Shu

Multimedia Datasets for Repeatable Experimentation (Special Session)

Frontmatter

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations

In the last decade, user-centric video search competitions have facilitated the evolution of interactive video search systems. So far, these competitions focused on a small number of search task categories, with few attempts to change task category configurations. Based on our extensive experience with interactive video search contests, we have analyzed the spectrum of possible task categories and propose a list of individual axes that define a large space of possible task categories. Using this concept of category space, new user-centric video search competitions can be designed to benchmark video search systems from different perspectives. We further analyse the three task categories considered so far at the Video Browser Showdown and discuss possible (but sometimes challenging) shifts within the task category space.

Jakub Lokoč, Werner Bailer, Kai Uwe Barthel, Cathal Gurrin, Silvan Heller, Björn þór Jónsson, Ladislav Peška, Luca Rossetto, Klaus Schoeffmann, Lucia Vadicamo, Stefanos Vrochidis, Jiaxin Wu

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval

Even though it has extensively been shown that retrieval specific training of deep neural networks is beneficial for nearest neighbor image search quality, most of these models are trained and tested in the domain of landmarks images. However, some applications use images from various other domains and therefore need a network with good generalization properties - a general-purpose CBIR model. To the best of our knowledge, no testing protocol has so far been introduced to benchmark models with respect to general image retrieval quality. After analyzing popular image retrieval test sets we decided to manually curate GPR1200, an easy to use and accessible but challenging benchmark dataset with a broad range of image categories. This benchmark is subsequently used to evaluate various pretrained models of different architectures on their generalization qualities. We show that large-scale pretraining significantly improves retrieval performance and present experiments on how to further increase these properties by appropriate fine-tuning. With these promising results, we hope to increase interest in the research topic of general-purpose CBIR.

Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung

LLQA - Lifelog Question Answering Dataset

Recollecting details from lifelog data involves a higher level of granularity and reasoning than a conventional lifelog retrieval task. Investigating the task of Question Answering (QA) in lifelog data could help in human memory recollection, as well as improve traditional lifelog retrieval systems. However, there has not yet been a standardised benchmark dataset for the lifelog-based QA. In order to provide a first dataset and baseline benchmark for QA on lifelog data, we present a novel dataset, LLQA, which is an augmented 85-day lifelog collection and includes over 15,000 multiple-choice questions. We also provide different baselines for the evaluation of future works. The results showed that lifelog QA is a challenging task that requires more exploration. The dataset is publicly available at https://github.com/allie-tran/LLQA .

Ly-Duyen Tran, Thanh Cong Ho, Lan Anh Pham, Binh Nguyen, Cathal Gurrin, Liting Zhou

Learning

Frontmatter

Category-Sensitive Incremental Learning for Image-Based 3D Shape Reconstruction

Recovering the three-dimensional shape of an object from a two-dimensional image is an important research topic in computer vision. Traditional methods use stereo vision or inter-image matching to obtain geometric information about the object, but they require more than one image as input and are more demanding. Recently, the CNN-based approach enables reconstruction using only a single image. However, they rely on limited categories of objects in large-scale datasets, which leads to limitations in their scope of application. In this paper, we propose an incremental 3D reconstruction method. When new interested categories are labeled and provided, we can finetune the network to meet new needs while retaining old knowledge. To achieve these requirements, we introduce the category-wise and instance-wise contrastive loss and the energy-based classification loss. They help the network distinguish between different categories, especially when faced with new ones, and the uniqueness and variability of the predictions generated for different instances. Extensive experiments demonstrate the soundness and feasibility of our approach. We hope our work can attract further research.

Yijie Zhong, Zhengxing Sun, Shoutong Luo, Yunhan Sun, Wei Zhang

AdaConfigure: Reinforcement Learning-Based Adaptive Configuration for Video Analytics Services

The configuration in video analytics defines parameters including frame rate, image resolution, and model selection for video analytics pipeline, and thus determines the inference accuracy and resource consumption. Traditional solutions to select a configuration are either fixed (i.e., the same configuration is used all the time) or periodically adjusted using a brute-force search scheme (i.e., periodically trying different configurations and selecting the one with the best performance), and thus suffer either low inference accuracy or high computation cost to find a proper configuration timely. To this end, we propose a video analytical configuration adaptation framework called AdaConfigure that dynamically selects video configuration without resource-consuming exploration. First, we design a reinforcement learning-based framework in which an agent adaptively chooses the configuration according to the spatial and temporal features of the current video stream. In particular, we use a video segmentation strategy to capture the characteristics of the video stream with much-reduced computation cost: profiling uses only 0.2–2% computation resources as compared to a full video. Second, we design a reward function that considers both the inference accuracy and computation resource consumption so that the configuration achieves good accuracy and resource consumption trade-off. Our evaluation experiments on an object detection task show that our approach outperforms the baseline: it achieves 10–35% higher accuracy with a similar amount of computation resources or achieves similar accuracy with only 10–50% of the computation resources.

Zhaoliang He, Yuan Wang, Chen Tang, Zhi Wang, Wenwu Zhu, Chenyang Guo, Zhibo Chen

Mining Minority-Class Examples with Uncertainty Estimates

In the real world, the frequency of occurrence of objects is naturally skewed forming long-tail class distributions, which results in poor performance on the statistically rare classes. A promising solution is to mine tail-class examples to balance the training dataset. However, mining tail-class examples is a very challenging task. For instance, most of the otherwise successful uncertainty-based mining approaches struggle due to distortion of class probabilities resulting from skewness in data. In this work, we propose an effective, yet simple, approach to overcome these challenges. Our framework enhances the subdued tail-class activations and, thereafter, uses a one-class data-centric approach to effectively identify tail-class examples. We carry out an exhaustive evaluation of our framework on three datasets spanning over two computer vision tasks. Substantial improvements in the minority-class mining and fine-tuned model’s task performance strongly corroborate the value of our method.

Gursimran Singh, Lingyang Chu, Lanjun Wang, Jian Pei, Qi Tian, Yong Zhang

Conditional Context-Aware Feature Alignment for Domain Adaptive Detection Transformer

Detection transformers have recently gained increasing attention, due to its competitive performance and end-to-end pipeline. However, they suffer significant performance drop when the test and training data are drawn from different distributions. Existing domain adaptive detection transformer methods adopt feature distribution alignment to alleviate the domain gaps. While effective, they ignore the class semantics and rich context preserved in attention mechanism during adaptation, which leads to inferior performance. To tackle these problems, we propose Conditional Context-aware Feature Alignment (CCFA) for domain adaptive detection transformer. Specifically, a context-aware feature alignment module is proposed to map the high-dimensional context into low-dimensional space, so that the rich context can be utilized for distribution alignment without optimization difficulty. Moreover, a conditional distribution alignment module is adopted to align features of the same object class from different domains, which better preserves discriminability during adaptation. Experiments on three common benchmarks demonstrate CCFA’s superiority over state-of-the-arts.

Siyuan Chen

Multimedia for Medical Applications (Special Session)

Frontmatter

Human Activity Recognition with IMU and Vital Signs Feature Fusion

Combining data from different sources into an integrated view is a recent trend taking advantage of the Internet of Things (IoT) evolution over the last years. The fusion of different modalities has applications in various fields, including healthcare and security systems. Human activity recognition (HAR) is among the most common applications of a healthcare or eldercare system. Inertial measurement unit (IMU) wearable sensors, like accelerometers and gyroscopes, are often utilized for HAR applications. In this paper, we investigate the performance of wearable IMU sensors along with vital signs sensors for HAR. A massive feature extraction, including both time and frequency domain features and transitional features for the vital signs, along with a feature selection method were performed. The classification algorithms and different early and late fusion methods were applied to a public dataset. Experimental results revealed that both IMU and vital signs achieve reasonable HAR accuracy and F1-score among all the classes. Feature selection significantly reduced the number of features from both IMU and vital signs features while also improved the classification accuracy. The rest of the early and late level fusion methods also performed better than each modality alone, reaching an accuracy level of up to 95.32%.

Vasileios-Rafail Xefteris, Athina Tsanousa, Thanassis Mavropoulos, Georgios Meditskos, Stefanos Vrochidis, Ioannis Kompatsiaris

On Assisting Diagnoses of Pareidolia by Emulating Patient Behavior

The pareidolia phenomenon is a discriminating characteristic of psychiatric disorders, expressed through visual illusions seen by patients. Typically, it can be diagnosed through the noise pareidolia test, which is time-consuming to both patients and experts. In this research, we propose a novel computer-assisted method to identify pareidolia phenomenon. The idea is to emulate patient behavior in face detection models to get a similar behavior in noise pareidolia tests as patients. Unlike most medical image analysis methods, for psychiatric disorders the ground-truth varies from patient to patient, making this challenging. For a set of training patients, we fine-tune reference models to detect noise pareidolia test responses in the same way as each individual patient. Then, a new test patient is identified by comparing their behavior to the reference models using a distance function in a trained embedding space. In the experiments, the effectiveness of the proposed method is demonstrated. Further, we can show that our method can improve the efficiency of the clinical noise pareidolia test by reducing the number of necessary test images while reaching a comparable high accuracy.

Zhaohui Zhu, Marc A. Kastner, Shin’ichi Satoh

Using Explainable AI to Identify Differences Between Clinical and Experimental Pain Detection Models Based on Facial Expressions

Most of the currently available pain datasets use two types of pain stimuli - people with clinically diagnosed conditions (e.g. surgery) performing tasks that cause them pain (we call this clinical pain) and pain caused by external stimuli such as heat or electricity (we call this experimental pain). In high-risk domains like healthcare, understanding the decisions and limitations of various types of pain recognition models is pivotal for the acceptance of the technology. In this paper, we present a process based on Explainable Artificial Intelligence techniques to investigate the differences in the learned representations of models trained on experimental pain (BioVid heat pain dataset) and clinical pain (UNBC shoulder pain dataset). To this end, we first train two convolutional neural networks - one for each dataset - to automatically discern between pain and no pain. Next, we perform a cross-dataset evaluation, i.e., evaluate the performance of the heat pain model on images from the shoulder pain dataset and vice versa. Then, we use Layer-wise Relevance Propagation to analyze which parts of the images in our test sets were relevant for each pain model. Based on this analysis, we rely on the visual inspection by a human observer to generate hypotheses about learned concepts that distinguish the two models. Finally, we test those hypotheses quantitatively utilizing concept embedding analysis methods. Through this process, we identify (1) a concept which the clinical pain model is more strongly relying on and, (2) a concept which the experimental pain model is paying more attention to. Finally, we discuss how both of these concepts are involved in known pain patterns and can be attributed to behavioral differences found in the datasets.

Pooja Prajod, Tobias Huber, Elisabeth André

Applications 2

Frontmatter

Double Granularity Relation Network with Self-criticism for Occluded Person Re-identification

Occluded person Re-identification is still a challenge. Most existing methods capture visible human parts based on external cues, such as human pose and semantic mask. In this paper, we propose a double granularity relation network with self-criticism to locate visible human parts. We learn the region-wise relation between part and whole and pixel-wise relation between pixel and whole. The relations find non-occluded human body parts and exclude noisy information. To guide the relation learning, we introduce two relation critic losses, which score the parts and maximize the performance by imposing higher weights on large parts and lower ones on small parts. We design the double branch model based on the proposed critic loss and evaluate it on the popular benchmarks. The experimental results show the superiority of our method, which achieves mAP of 51.0% and 75.4% respectively on Occluded-DukeMTMC and P-DukeMTMC-reID. Our codes are available at DRNC .

Xuena Ren, Dongming Zhang, Xiuguo Bao, Lei Shi

A Complementary Fusion Strategy for RGB-D Face Recognition

RGB-D Face Recognition (FR) with low-quality depth maps recently plays an important role in biometric identification. Intrinsic geometry properties and shape clues reflected by depth information significantly promote the FR robustness to light and pose variations. However, the existing multi-modal fusion methods mostly lack the ability of complementary feature learning and establishing correlated relationships between different facial features. In this paper, we propose a Complementary Multi-Modal Fusion Transformer (CMMF-Trans) network which is able to complement the fusion while preserving the modal-specific properties. In addition, the proposed novel tokenization and self-attention modules stimulate the Transformer to capture long-range dependencies supplementary to local representations of face areas. We test our model on two public datasets: Lock3DFace and IIIT-D which contain challenging variations in pose, occlusion, expression and illumination. Our strategy achieves the state-of-the-art performance on them. Another meaningful contribution in our work is that we have created a challenging RGB-D FR dataset which contains more kinds of difficult scenarios, such as, mask occlusion, backlight shadow, etc.

Haoyuan Zheng, Weihang Wang, Fei Wen, Peilin Liu

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

RGB-D object detection is a fundamental yet challenging task due to the inherent difference between the RGB and Depth information. In this paper, we propose a Multi-scale Cross-modal Transformer Network (MCTNet) consisting of two well-designed components: the Multi-modal Feature Pyramid module (MFP), and the Cross-Modal Transformer (CMTrans). Specially, we introduce the MFP to enrich the high-level semantic features with geometric information and enhance low-level geometric clues with semantic features, which is demonstrated facilitating the further cross-modal feature fusion. Furthermore, we develop the CMTrans to effectively exploit the long-range attention between the enhanced RGB and depth features, enabling the network to focus on regions of interest. Extensive experiments show our MCTNet surpasses state-of-the-art detectors by 1.6% mAP on SUN RGB-D and 1.0% mAP on NYU Depth v2, which demonstrates the effectiveness of the proposed method.

Zhibin Xiao, Pengwei Xie, Guijin Wang

Joint Re-Detection and Re-Identification for Multi-Object Tracking

Within the tracking-by-detection framework, multi-object tracking (MOT) has always been plagued by missing detection. To address this problem, existing methods usually predict new positions of the trajectories first to provide more candidate bounding boxes (BBoxes), and then use non-maximum suppression (NMS) to eliminate the redundant BBoxes. However, when two BBoxes belonging to different objects have a significant intersection over union (IoU) due to occlusion, NMS will mistakenly filter out the one with lower confidence score, and these methods ignore the missing detection caused by NMS. We propose a joint re-detection and re-identification tracker (JDI) for MOT, consisting of two components, trajectory re-detection and NMS with re-identification (ReID). Specifically, the trajectory re-detection could predict new position of the trajectory in detection, a more reliable way than motion model (MM), based on feature matching. Furthermore, we propose to embed ReID features into NMS and take the similarity of the ReID features as an additional necessary condition to determine whether two BBoxes are the same object. Based on the “overlap degree” calculated by IoU and the similarity of ReID features, accurate filtering can be achieved through double-checking. We demonstrate the effectiveness of our tracking components with ablative experiments and surpass the state-of-the-art methods on the three tracking benchmarks MOT16, MOT17, and MOT20.

Jian He, Xian Zhong, Jingling Yuan, Ming Tan, Shilei Zhao, Luo Zhong

Multimedia Analytics for Contextual Human Understanding (Special Session)

Frontmatter

An Investigation into Keystroke Dynamics and Heart Rate Variability as Indicators of Stress

Lifelogging has become a prominent research topic in recent years. Wearable sensors like Fitbits and smart watches are now increasingly popular for recording one’s activities. Some researchers are also exploring keystroke dynamics for lifelogging. Keystroke dynamics refers to the process of measuring and assessing a person’s typing rhythm on digital devices. A digital footprint is created when a user interacts with devices like keyboards, mobile phones or touch screen panels and the timing of the keystrokes is unique to each individual though likely to be affected by factors such as fatigue, distraction or emotional stress. In this work we explore the relationship between keystroke dynamics as measured by the timing for the top-10 most frequently occurring bigrams in English, and the emotional state and stress of an individual as measured by heart rate variabiity (HRV). We collected keystroke data using the Loggerman application while HRV was simultaneously gathered. With this data we performed an analysis to determine the relationship between variations in keystroke dynamics and variations in HRV. Our conclusion is that we need to use a more detailed representation of keystroke timing than the top-10 bigrams, probably personalised to each user.

Srijith Unni, Sushma Suryanarayana Gowda, Alan F. Smeaton

Fall Detection Using Multimodal Data

In recent years, the occurrence of falls has increased and has had detrimental effects on older adults. Therefore, various machine learning approaches and datasets have been introduced to construct an efficient fall detection algorithm for the social community. This paper studies the fall detection problem based on a large public dataset, namely the UP-Fall Detection Dataset. This dataset was collected from a dozen of volunteers using different sensors and two cameras. We propose several techniques to obtain valuable features from these sensors and cameras and then construct suitable models for the main problem. The experimental results show that our proposed methods can bypass the state-of-the-art methods on this dataset in terms of accuracy, precision, recall, and F1-score.

Thao V. Ha, Hoang Nguyen, Son T. Huynh, Trung T. Nguyen, Binh T. Nguyen

Prediction of Blood Glucose Using Contextual LifeLog Data

In this paper, we describe a novel approach to the prediction of human blood glucose levels by analysing rich biometric human contextual data from a pioneering lifelog dataset. Numerous prediction models (RF, SVM, XGBoost and Elastic-Net) along with different combinations of input attributes are compared. An efficient ensemble method of stacking of multiple combination of prediction models was also implemented as our contribution. It was found that XGBoost outperformed three other models and that a stacking ensemble method further improved the performance.

Tenzin Palbar, Manoj Kesavulu, Cathal Gurrin, Renaat Verbruggen

Multimodal Embedding for Lifelog Retrieval

Nowadays, research on lifelog retrieval is attracting increasing attention with a focus on applying machine learning, especially for data annotation/enrichment which is necessary to facilitate effective retrieval. In this paper, we propose two annotation approaches that apply state-of-the-art text/visual and joint embedding technologies for lifelog query-text retrieval tasks. Both approaches are evaluated on the commonly used NTCIR13-lifelog dataset and the results demonstrate embedding techniques show improved retrieval accuracy over conventional text matching methods.

Liting Zhou, Cathal Gurrin

Applications 3

Frontmatter

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Image-Text Retrieval (ITR) enables users to retrieve relevant contents from different modalities and has attracted considerable attention. Existing approaches typically utilize contrastive loss functions to conduct contrastive learning in the common embedding space, where they aim at pulling semantically related pairs closer while pushing away unrelated pairs. However, we argue that this behaviour is too strict: these approaches neglect to address the inherent misalignments from potential semantically related samples. For example, it commonly exists more than one positive samples in the current batch for a given query and previous methods enforce them apart even if they are semantically related, which leads to a sub-optimal and contradictory optimization direction and then decreases the retrieval performance. In this paper, a Multiple Positives Enhanced Noise Contrastive Estimation learning objective is proposed to alleviate the diversion noise by leveraging and optimizing multiple positive pairs overall for each sample in a mini-batch. We demonstrate the effectiveness of our approach on MS-COCO and Flickr30K datasets for image-to-text and text-to-image retrieval.

Yi Li, Dehao Wu, Yuesheng Zhu

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

Scene text recognition, which detects and recognizes the text in the image, has engaged extensive research interest. Attention mechanism based methods for scene text recognition have achieved competitive performance. For scene text recognition, the attention mechanism is usually combined with RNN structures as a module to predict the results. However, RNN attention-based methods are sometimes hard to converge on account of gradient vanishing/exploding during training, and RNN cannot be computed in parallel. To remedy this issue, we propose a Swin Transformer-based encoder-decoder mechanism, which relies entirely on the self attention mechanism (SAM) and can be computed in parallel. SAM is an efficient text recognizer that is only formed by two components: 1) an encoder based on Swin Transformer that gets the visual information of input image, and 2) a Transformer-based decoder gets the final results by applying self attention to the output of encoder. Considering that the scale of scene text has a large variation in images, we apply the Swin Transformer to compute the visual features with shifted windows, which permits self attention computation to cross-window connections and limits for non-overlapping local window. Our method has improved in accuracy over previous methods at ICDAR2003, ICDAR2013, SVT, SVT-P, CUTE and ICDAR2015 by 0.9%, 3.2%, 0.8%, 1.3%, 1.7%, 1.1% respectively. Especially, our method achieved the fastest predict time of 0.02s per image.

Xiang Shuai, Xiao Wang, Wei Wang, Xin Yuan, Xin Xu

JVCSR: Video Compressive Sensing Reconstruction with Joint In-Loop Reference Enhancement and Out-Loop Super-Resolution

Taking advantage of spatial and temporal correlations, deep learning-based video compressive sensing reconstruction (VCSR) technologies have tremendously improved reconstructed video quality. Existing VCSR works mainly focus on improving deep learning-based motion compensation without optimizing local and global information, leaving much space for further improvements. This paper proposes a video compressive sensing reconstruction method with joint in-loop reference enhancement and out-loop super-resolution (JVCSR), focusing on removing reconstruction artifacts and increasing the resolution simultaneously. As an in-loop part, the enhanced frame is utilized as a reference to improve the recovery performance of the current frame. Furthermore, it is the first time to propose out-loop super-resolution for VCSR to obtain high-quality images at low bitrates. As a result, JVCSR obtains an average improvement of 1.37 dB PSNR compared with state-of-the-art compressive sensing methods at the same bitrate.

Jian Yang, Chi Do-Kim Pham, Jinjia Zhou

Point Cloud Upsampling via a Coarse-to-Fine Network

Point clouds captured by 3D scanning are usually sparse and noisy. Reconstructing a high-resolution 3D model of an object is a challenging task in computer vision. Recent point cloud upsampling approaches aim to generate a dense point set, while achieving both distribution uniformity and proximity-to-surface directly via an end-to-end network. Although dense reconstruction from low to high resolution can be realized by using these techniques, it lacks abundant details for dense outputs. In this work, we propose a coarse-to-fine network PUGL-Net for point cloud reconstruction that first predicts a coarse high-resolution point cloud via a global dense reconstruction module and then increases the details by aggregating local point features. On the one hand, a transformer-based mechanism is designed in the global dense reconstruction module. It aggregates residual learning in a self-attention scheme for effective global feature extraction. On the other hand, the coordinate offset of points is learned in a local refinement module. It further refines the coarse points by aggregating KNN features. Evaluated through extensive quantitative and qualitative evaluation on synthetic data set, the proposed coarse-to-fine architecture generates point clouds that are accurate, uniform and dense, it outperforms most existing state-of-the-art point cloud reconstruction works.

Yingrui Wang, Suyu Wang, Longhua Sun

Image Analytics

Frontmatter

Arbitrary Style Transfer with Adaptive Channel Network

Arbitrary style transfer aims to obtain a brand new stylized image by adding arbitrary artistic style elements to the original content image. It is difficult for recent arbitrary style transfer algorithms to recover enough content information while maintaining good stylization characteristics. The balance between style information and content information is the main difficulty. Moreover, these algorithms tend to generate fuzzy blocks, color spots and other defects in the image. In this paper, we propose an arbitrary style transfer algorithm based on adaptive channel network (AdaCNet), which can flexibly select specific channels for style conversion to generate stylized images. In our algorithm, we introduce a content reconstruction loss to maintain local structure invariance, and a new style consistency loss that improves the stylization effect and style generalization ability. Experimental results show that, compared with other advanced methods, our algorithm maintains the balance between style information and content information, eliminates some defects such as blurry blocks, and also achieves good performance on the task of style generalization and transferring high-resolution images.

Yuzhuo Wang, Yanlin Geng

Fast Single Image Dehazing Using Morphological Reconstruction and Saturation Compensation

Despite having effective dehzing performance, single image dehazing methods based on the dark channel prior (DCP) still suffer from slightly dark dehazing results and oversaturated sky regions. An improved single image dehazing method, which combines image enhancement techniques with DCP model, is proposed to overcome this deficiency. Firstly, it is analyzed that the cause of darker results mainly lies in the air-light overestimation caused by bright ambient light and white objects. Then, the air-light estimation is modified by combining morphological reconstruction with DCP. Next, it is derived that appropriately increasing the saturation component can compensate for transmission underestimate, which can further alleviate the oversaturation. Finally, the image dehazed with modified air-light and transmission is further refined by linear intensity transformation to improve contrast. Extensive experiments validate the proposed method, which is on par with and even outperforms the state-of-the-art methods in subjective and objective evaluation.

Shuang Zheng, Liang Wang

One-Stage Image Inpainting with Hybrid Attention

Recently, attention-related image inpainting methods have achieved remarkable performance. They reconstruct damaged regions based on contextual information. However, due to the time-consuming two-stage coarse-to-fine architecture and the single-layer attention manner, they often have limitations in generating reasonable and fine-detailed results for irregularly damaged images. In this paper, we propose a novel one-stage image inpainting method with a Hybrid Attention Module (HAM). Specifically, the proposed HAM contains two submodules, namely, the Pixel-Wise Spatial Attention Module (PWSAM) and the Multi-Scale Channel Attention Module (MSCAM). Benefit from these, the reconstructed image features in spatial dimension can be further optimized in channel dimension to make inpainting results more visually realistic. Qualitative and quantitative experiments on three public datasets show that our proposed method outperforms state-of-the-art methods.

Lulu Zhao, Ling Shen, Richang Hong

Real-Time FPGA Design for OMP Targeting 8K Image Reconstruction

During the past decade, implementing reconstruction algorithms on hardware has been at the center of much attention in the field of real-time reconstruction in Compressed Sensing (CS). Orthogonal Matching Pursuit (OMP) is the most widely used reconstruction algorithm on hardware implementation because OMP obtains good quality reconstruction results under a proper time cost. OMP includes Dot Product (DP) and Least Square Problem (LSP). These two parts have numerous division calculations and considerable vector-based multiplications, which limit the implementation of real-time reconstruction on hardware. In the theory of CS, besides the reconstruction algorithm, the choice of sensing matrix affects the quality of reconstruction. It also influences the reconstruction efficiency by affecting the hardware architecture. Thus, designing a real-time hardware architecture of OMP needs to take three factors into consideration. The choice of sensing matrix, the implementation of DP and LSP. In this paper, a sensing matrix, which is sparsity and contains zero vectors mainly, is adopted to optimize the OMP reconstruction to break the bottleneck of reconstruction efficiency. Based on the features of the chosen matrix, the DP and LSP are implemented by simple shift, add and comparing procedures. This work is implemented on the Xilinx Virtex UltraScale+ FPGA device. To reconstruct a digital signal with 1024 length under 0.25 sampling rate, the proposal method costs 0.818 $$\upmu $$ μ s while the state-of-the-art costs 238 $$\upmu $$ μ s. Thus, this work speedups the state-of-the-art method 290 times. This work costs 0.026s to reconstruct an 8K gray image, which achieves 30FPS real-time reconstruction.

Jiayao Xu, Chen Fu, Zhiqiang Zhang, Jinjia Zhou

Speech and Music

Frontmatter

Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

In the field of Human-Computer Interaction (HCI), Speech Emotion Recognition (SER) is not only a fundamental step towards intelligent interaction but also plays an important role in smart environments e.g., elderly home monitoring. Most deep learning based SER systems invariably focus on handling high-level emotion-relevant features, which means the low-level feature difference between time and frequency dimensions is rarely analyzed. And it leads to an unsatisfactory accuracy in speech emotion recognition. In this paper, we propose the Time-Frequency Attention (TFA) to mine the significant low-level emotion feature from the time domain and the frequency domain. To make full use of the global information after feature fusion conducted by the TFA, we utilize Squeeze-and-Excitation (SE) blocks to compare emotion features from different channels. Experiments are conducted on a benchmark database - Interactive Emotional Dyadic Motion Capture (IEMOCAP). The results indicate that proposed model outperforms the sate-of-the-art methods with the absolute increase of 1.7% and 3.2% on average class accuracy among four emotion classes and weighted accuracy respectively.

Ke Liu, Chen Wang, Jiayue Chen, Jun Feng

Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Many studies enhance speech intelligibility by speaking style conversion (SSC), which relies solely on the Lombard effect does not work well in strong noise interference. They also model the conversion of fundamental frequency (F0) with a straightforward linear transform and map only a very few dimensions Mel-cepstral coefficients (MCEPs). As F0 and MCEPs are critical aspects of hierarchical intonation, we believe that adequate modeling of these features is essential. In this paper, we make a creative study of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales that describe speech at different time resolutions for effective F0 conversion, and we also express MCEPs with 20 dimensions over baseline 10 dimensions for MCEPs conversion. We utilize an iMetricGAN network to optimize the speech intelligibility metrics in strong noise. Experimental results show that proposed Non-Parallel Speech Style Conversion using CWT and iMetricGAN based CycleGAN (NS-CiC) method outperforms the baselines that significantly increased speech intelligibility in robust noise environments in objective and subjective evaluations.

Jing Xiao, Jiaqi Liu, Dengshi Li, Lanxin Zhao, Qianrui Wang

A-Muze-Net: Music Generation by Composing the Harmony Based on the Generated Melody

We present a method for the generation of Midi files of piano music. The method models the right and left hands using two networks, where the left hand is conditioned on the right hand. This way, the melody is generated before the harmony. The Midi is represented in a way that is invariant to the musical scale, and the melody is represented, for the purpose of conditioning the harmony, by the content of each bar, viewed as a chord. Finally, notes are added randomly, based on this chord representation, in order to enrich the generated audio. Our experiments show a significant improvement over the state of the art for training on such datasets, and demonstrate the contribution of each of the novel components.

Or Goren, Eliya Nachmani, Lior Wolf

Melody Generation from Lyrics Using Three Branch Conditional LSTM-GAN

With the availability of paired lyrics-melody dataset and advancements of artificial intelligence techniques, research on melody generation conditioned on lyrics has become possible. In this work, for melody generation, we propose a novel architecture, Three Branch Conditional (TBC) LSTM-GAN conditioned on lyrics which is composed of a LSTM-based generator and discriminator respectively. The generative model is composed of three branches of identical and independent lyrics-conditioned LSTM-based sub-networks, each responsible for generating an attribute of a melody. For discrete-valued sequence generation, we leverage the Gumbel-Softmax technique to train GANs. Through extensive experiments, we show that our proposed model generates tuneful and plausible melodies from the given lyrics and outperforms the current state-of-the-art models quantitatively as well as qualitatively.

Abhishek Srivastava, Wei Duan, Rajiv Ratn Shah, Jianming Wu, Suhua Tang, Wei Li, Yi Yu

Multimodal Analytics

Frontmatter

Bi-attention Modal Separation Network for Multimodal Video Fusion

With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal video understanding has received increasing attention from the scientific community. Video is usually composed of multimodal signals, such as video, text, image and audio, etc. The main method addressing this task is to develop powerful multimodal fusion techniques. Multimodal data fusion is to transform data from multiple single-mode representations to a compact multimodal representation. Effective multimodal fusion method should contain two key characteristics: the consistency and the difference. Previous studies mainly focused on applying different interaction methods to different modal fusion such as late fusion, early fusion, attention fusion, etc., but ignored the study of modal independence in the fusion process. In this paper, we introduce a fusion approach called bi-attention modal separation fusion network(BAMS) which can extract and integrate key information from various modalities and performs fusion and separation on modality representations. We conduct thorough ablation studies, and our experiments on datasets MOSI and MOSEI demonstrate significant gains over state-of-the-art models.

Pengfei Du, Yali Gao, Xiaoyong Li

Combining Knowledge and Multi-modal Fusion for Meme Classification

Internet memes are widespread on social media platforms such as Twitter and Facebook. Recently, meme classification has been an active research topic, especially meme sentiment classification and meme offensive classification. Internet memes contain multi-modal information, and the meme text is embedded in the meme image. The existing methods classify memes by simply concatenating global visual and textual features to generate a multi-modal representation. However, these approaches ignored the noise introduced by global visual features and the potential common information of meme multi-modal representation. In this paper, we propose a model for meme classification named MeBERT. Our method enhances the semantic representation of the meme by introducing conceptual information through external Knowledge Bases (KBs). Then, to reduce noise, a concept-image attention module is designed to extract concept-sensitive visual representation. In addition, a deep convolution tensor fusion module is built to effectively integrate multi-modal information. To verify the effectiveness of the model in the tasks of meme sentiment classification and meme offensive classification, we designed experiments on the Memotion and MultiOFF datasets. The experimental results show that the MeBERT model achieves better performance than state-of-the-art techniques for meme classification.

Qi Zhong, Qian Wang, Ji Liu

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Remarkable success has been achieved in the multi-modal sentiment analysis community thanks to the existence of annotated multi-modal data sets. However, coming from three different modalities, text, sound, and vision, establishes significant barriers for better feature fusion. In this paper, we introduce “NUAN”, a non-uniform attention network for multi-modal feature fusion. NUAN is designed based on attention mechanism via considering three modalities simultaneously, but not uniformly: the text is seen as a determinate representation, with the hope that by leveraging the acoustic and visual representation, we are able to inject the effective information into a solid representation, named as tripartite interaction representation. A novel non-uniform attention module is inserted into adjacent time steps in LSTM (Long Shot-Term Memory) and processes information recurrently. The final outputs of LSTM and NUAM are concatenated to a vector, which is imported into a linear embedding layer to output the sentiment analysis result. The experimental analysis of two databases demonstrates the effectiveness of the proposed method.

Binqiang Wang, Gang Dong, Yaqian Zhao, Rengang Li, Qichun Cao, Yinyin Chao

Multimodal Unsupervised Image-to-Image Translation Without Independent Style Encoder

The multi-modal image-to-image translation frameworks often have the problems of complex model structure and low training efficiency. In addition, we find that although these methods can maintain the structural information of the source image well, they cannot transfer the style of the reference image well. To solve these problems, we propose a novel framework called Multimodal-No-Independent-Style-Encoder Generative Adversarial Network (MNISE-GAN) that simplifies the overall network structure by reusing the front part of the discriminator as the style encoder so it can achieve multi-modal image translation more effectively. At the same time, the discriminator directly uses the style code to classify real and synthetic samples, so it can enhance the classification ability and improve training efficiency. To enhance the style transfer ability, we propose a multi-scale style module embedded in the generator, and propose an Adaptive Layer-Instance-Group Normalization (AdaLIGN) to further strengthen the generator’s ability to control texture. Extensive experiments on four popular image translation benchmarks quantitative and qualitative results demonstrate that our method is superior to state-of-the-art methods

Yanbei Sun, Yao Lu, Haowei Lu, Qingjie Zhao, Shunzhou Wang

Springer Professional

About this book

Table of Contents

Frontmatter

Correction to: Real-Time FPGA Design for OMP Targeting 8K Image Reconstruction

Best Paper Session

Frontmatter

Real-time Detection of Tiny Objects Based on a Weighted Bi-directional FPN

Multi-modal Fusion Network for Rumor Detection with Texts and Images

PF-VTON: Toward High-Quality Parser-Free Virtual Try-On Network

MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis

Applications 1

Frontmatter

Learning to Classify Weather Conditions from Single Images Without Labels

Learning Image Representation via Attribute-Aware Attention Networks for Fashion Classification

Toward Detail-Oriented Image-Based Virtual Try-On with Arbitrary Poses

Parallel DBSCAN-Martingale Estimation of the Number of Concepts for Automatic Satellite Image Clustering

Multimedia Applications - Perspectives, Tools and Applications (Special Session) and Brave New Ideas

Frontmatter

AI for the Media Industry: Application Potential and Automation Levels

Rating-Aware Self-Organizing Maps

Color the Word: Leveraging Web Images for Machine Translation of Untranslatable Words

Activities and Events

Frontmatter

MGMP: Multimodal Graph Message Propagation Network for Event Detection

Pose-Enhanced Relation Feature for Action Recognition in Still Images

Prostate Segmentation of Ultrasound Images Based on Interpretable-Guided Mathematical Model

Spatiotemporal Perturbation Based Dynamic Consistency for Semi-supervised Temporal Action Detection

Multimedia Datasets for Repeatable Experimentation (Special Session)

Frontmatter

A Task Category Space for User-Centric Comparative Multimedia Search Evaluations

GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval

LLQA - Lifelog Question Answering Dataset

Learning

Frontmatter

Category-Sensitive Incremental Learning for Image-Based 3D Shape Reconstruction

AdaConfigure: Reinforcement Learning-Based Adaptive Configuration for Video Analytics Services

Mining Minority-Class Examples with Uncertainty Estimates

Conditional Context-Aware Feature Alignment for Domain Adaptive Detection Transformer

Multimedia for Medical Applications (Special Session)

Frontmatter

Human Activity Recognition with IMU and Vital Signs Feature Fusion

On Assisting Diagnoses of Pareidolia by Emulating Patient Behavior

Using Explainable AI to Identify Differences Between Clinical and Experimental Pain Detection Models Based on Facial Expressions

Applications 2

Frontmatter

Double Granularity Relation Network with Self-criticism for Occluded Person Re-identification

A Complementary Fusion Strategy for RGB-D Face Recognition

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

Joint Re-Detection and Re-Identification for Multi-Object Tracking

Multimedia Analytics for Contextual Human Understanding (Special Session)

Frontmatter

An Investigation into Keystroke Dynamics and Heart Rate Variability as Indicators of Stress

Fall Detection Using Multimodal Data

Prediction of Blood Glucose Using Contextual LifeLog Data

Multimodal Embedding for Lifelog Retrieval

Applications 3

Frontmatter

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer

JVCSR: Video Compressive Sensing Reconstruction with Joint In-Loop Reference Enhancement and Out-Loop Super-Resolution

Point Cloud Upsampling via a Coarse-to-Fine Network

Image Analytics

Frontmatter

Arbitrary Style Transfer with Adaptive Channel Network

Fast Single Image Dehazing Using Morphological Reconstruction and Saturation Compensation

One-Stage Image Inpainting with Hybrid Attention

Real-Time FPGA Design for OMP Targeting 8K Image Reconstruction

Speech and Music

Frontmatter

Time-Frequency Attention for Speech Emotion Recognition with Squeeze-and-Excitation Blocks

Speech Intelligibility Enhancement By Non-Parallel Speech Style Conversion Using CWT and iMetricGAN Based CycleGAN

A-Muze-Net: Music Generation by Composing the Harmony Based on the Generated Melody

Melody Generation from Lyrics Using Three Branch Conditional LSTM-GAN

Multimodal Analytics

Frontmatter

Bi-attention Modal Separation Network for Multimodal Video Fusion

Combining Knowledge and Multi-modal Fusion for Meme Classification

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Multimodal Unsupervised Image-to-Image Translation Without Independent Style Encoder