Skip to main content
Top

2023 | Book

MultiMedia Modeling

29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part II

Editors: Duc-Tien Dang-Nguyen, Cathal Gurrin, Martha Larson, Alan F. Smeaton, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, Phoebe Chen

Publisher: Springer Nature Switzerland

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The two-volume set LNCS 13833 and LNCS 13834 constitutes the proceedings of the 29th International Conference on MultiMedia Modeling, MMM 2023, which took place in Bergen, Norway, during January 9-12, 2023.
The 86 papers presented in these proceedings were carefully reviewed and selected from a total of 267 submissions. They focus on topics related to multimedia content analysis; multimedia signal processing and communications; and multimedia applications and services.

Table of Contents

Frontmatter

Multimedia Processing and Applications

Frontmatter
Transparent Object Detection with Simulation Heatmap Guidance and Context Spatial Attention

The texture scarcity properties make transparent object localization a challenging task in the computer vision community. This paper addresses this task in two aspects. (i) Additional guidance cues: we propose a Simulation Heatmap Guidance (SHG) to improve the localization ability of the model. Concretely, the target’s extreme points and inference centroids are used to generate simulation heatmaps to offer additional position guides. A high recall is rewarded even in extreme cases. (ii) Enhanced attention: we propose a Context Spatial Attention (CSA) combined with a unique backbone to build dependencies between feature points and to boost multi-scale attention fusion. CSA is a lightweight module and brings apparent perceptual gain. Experiments show that our method achieves more accurate detection for cluttered transparent objects in various scenarios and background settings, outperforming the existing methods.

Shuo Chen, Di Li, Bobo Ju, Linhua Jiang, Dongfang Zhao
Deep3DSketch+: Rapid 3D Modeling from Single Free-Hand Sketches

The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators’ ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.

Tianrun Chen, Chenglong Fu, Ying Zang, Lanyun Zhu, Jia Zhang, Papa Mao, Lingyun Sun
Manga Text Detection with Manga-Specific Data Augmentation and Its Applications on Emotion Analysis

We especially target at detecting text in atypical font styles and in cluttered background for Japanese comics (manga). To enable the detection model to detect atypical text, we augment training data by the proposed manga-specific data augmentation. A generative adversarial network is developed to generate atypical text regions, which are then blended into manga pages to largely increase the volume and diversity of training data. We verify the importance of manga-specific data augmentation. Furthermore, with the help of manga text detection, we fuse global visual features and local text features to enable more accurate emotion analysis.

Yi-Ting Yang, Wei-Ta Chu
SPEM: Self-adaptive Pooling Enhanced Attention Module for Image Recognition

Recently, many effective attention modules are proposed to boot the model performance by exploiting the internal information of convolutional neural networks in computer vision. In general, many previous works overlook the design of the pooling strategy of the attention mechanism since they adopt the global average pooling for granted, which hinders the further improvement of the performance of the attention mechanism. However, we empirically find and verify a phenomenon that the simple linear combination of global max-pooling and global min-pooling can produce effective pooling strategies that match or exceed the performance of global average pooling. Based on this empirical observation, we propose a simple-yet-effective attention module SPEM which adopts a self-adaptive pooling strategy based on global max-pooling and global min-pooling and a lightweight module for producing the attention map. The effectiveness of SPEM is demonstrated by extensive experiments on widely-used benchmark datasets and popular attention networks.

Shanshan Zhong, Wushao Wen, Jinghui Qin
Less Is More: Similarity Models for Content-Based Video Retrieval

The concept of object-to-object similarity plays a crucial role in interactive content-based video retrieval tools. Similarity (or distance) models are core components of several retrieval concepts, e.g. Query by Example or relevance feedback. In these scenarios, the common approach is to apply some feature extractor that transforms the object to a vector of features, i.e., positions it into an induced latent space. The similarity is then based on some distance metric in this space.Historically, feature extractors were mostly based on some color histograms or hand-crafted descriptors such as SIFT, but nowadays state-of-the-art tools mostly rely on some deep learning (DL) approaches. However, so far there were no systematic study of how suitable are individual feature extractors in the video retrieval domain. Or, in other words, to what extent are human-perceived and model-based similarities concordant. To fill this gap, we conducted a user study with over 4000 similarity judgements comparing over 20 variants of feature extractors. Results corroborate the dominance of deep learning approaches, but surprisingly favor smaller and simpler DL models instead of larger ones.

Patrik Veselý, Ladislav Peška
Edge Assisted Asymmetric Convolution Network for MR Image Super-Resolution

High-resolution magnetic resonance (MR) imaging is beneficial for accurate disease diagnosis and subsequent analysis. Currently, the single image super-resolution (SR) technique is an effective and less costly alternative technique to improve the spatial resolution of MR images. Structural information in MR images is crucial during clinical diagnosis, but it is often ignored by existing deep learning MR image SR technique. Consequently, we propose edge assisted feature extraction block (EAFEB), which can efficiently extract the content and edge features from low-resolution (LR) images, allowing the network to focus on both content and geometric structure. To fully utilize the features extracted by EAFEB, an asymmetric convolutional group (ACG) is proposed, which can balance structural feature preservation and content feature extraction. Moreover, we design a novel contextual spatial attention (CSA) method to facilitate the network focus on critical information. Experiment results in various MR image sequences, including T1, T2, and PD, show that our Edge Assisted Asymmetric Convolution Network (EAACN) has superior results relative to recent leading SR models.

Wanliang Wang, Fangsen Xing, Jiacheng Chen, Hangyao Tu
An Occlusion Model for Spectral Analysis of Light Field Signal

Occlusion is a common phenomenon in actual scenes, this phenomenon will seriously influence application of light field rendering (LFR) technology. We propose an occlusion model of scene surface that approximating the scene surface as a set of concave and convex parabolas to solve the light field (LF) reconstruction problem. The first step in this model involves determining the occlusion function. After obtaining the occlusion function, we can then perform the plenoptic spectral analysis. Through the plenoptic spectral analysis, the plenoptic spectrum will reveal occlusion characteristics. Finally, the occlusion characteristics can be used to determine the minimal sampling rate and a new reconstruction filter can also be applied to calibrate the aliasing spectrum to achieve high quality view synthesis. This extends previous works of LF reconstruction that considering reconstruction filter. Experimental evaluation demonstrates that our occlusion model significantly address occlusion problem while improve the rendering quality of light field.

Weiyan Chen, Changjian Zhu, Shan Zhang, Sen Xiang
Context-Guided Multi-view Stereo with Depth Back-Projection

Depth map based Multi-view stereo (MVS) is a task that focuses on taking images from multiple views of one same scene as input, estimating depth in each view, and generating 3D reconstructions of objects in the scene. Though most matching based MVS methods take features of the input images into account, few of them make the best of the underlying global information in images. They may suffer from difficult image regions, such as object boundaries, low-texture areas, and reflective surfaces. Human beings perceive these cases with the help of global awareness, that is to say, the context of the objects we observe. Similarly, we propose Context-guided Multi-view Stereo (ContextMVS), a coarse-to-fine pyramidal MVS network, which explicitly utilizes the context guidance in asymmetrical features to integrate global information into the 3D cost volume for feature matching. Also, with a low computational overhead, we adopt a depth back-projection refined up-sampling module to improve the non-parametric depth up-sampling between pyramid levels. Experimental results indicate that our method outperforms classical learning-based methods by a large margin on public benchmarks, DTU and Tanks and Temples, demonstrating the effectiveness of our method.

Tianxing Feng, Zhe Zhang, Kaiqiang Xiong, Ronggang Wang
RLSCNet: A Residual Line-Shaped Convolutional Network for Vanishing Point Detection

The convolutional neural network (CNN) is an effective model for vanishing point (VP) detection, but its success heavily relies on a massive amount of training data to ensure high accuracy. Without sufficient and balanced training data, the obtained CNN-based VP detection models can be easily overfitted with less generalization. By acknowledging that a VP in the image is the intersection of projections of multiple parallel lines in the scene and treating this knowledge as a geometric prior, we propose a prior-guided residual line-shaped convolutional network for VP detection to reduce the dependence of CNN on training data. In the proposed end-to-end approach, the probabilities of VP in the image are computed through an edge extraction subnetwork and a VP prediction subnetwork, which explicitly establishes the geometric relationships among edges, lines, and vanishing points by stacking the differentiable residual line-shaped convolutional modules. Our extensive experiments on various datasets show that the proposed VP detection network improves accuracy and outperforms previous methods in terms of both inference speed and generalization performance.

Wei Wang, Peng Lu, Xujun Peng, Wang Yin, Zhaoran Zhao
Energy Transfer Contrast Network for Unsupervised Domain Adaption

The main goal of unsupervised domain adaptation is to improve the classification performance on unlabeled data in target domains. Many methods try to reduce the domain gap by treating multiple domains as one to enhance the generalization of a model. However, aligning domains as a whole does not account for instance-level alignment, which might lead to sub-optimal results. Currently, many researchers utilize meta-learning and instance segmentation approaches to tackle this problem. But it can only obtain a further optimized the domain-invariant feature learned by the model, rather than achieve instance-level alignment. In this paper, we interpret unsupervised domain adaptation from a new perspective, which exploits the energy difference between the source and target domains to reduce the performance drops caused by the domain gap. At the same time, we improve and exploit the contrastive learning loss, which can push the target domain away from the decision boundary. The experimental results on different benchmarks against a range of the state-of-the-art approaches justify the performance and the effectiveness of the proposed method.

Jiajun Ouyang, Qingxuan Lv, Shu Zhang, Junyu Dong
Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works focus on designing complex modules, the so-called necks and heads, over simple backbones, while bringing a huge computational burden. In this paper, we bring a new insight: Vision Transformer itself is an all-in-one FGVC framework that consists of basic Backbone for feature extraction, Neck for further feature enhancement and Head for selecting discriminative feature. We delve into the feature extraction and representation pattern of ViT for FGVC and empirically show that simply recombining the original ViT structure to leverage multi-level semantic representation without introducing any other parameters is able to achieve higher performance. Under such insight, we proposed RecViT, a simple recombination and modification of original ViT, which can capture multi-level semantic features and facilitate fine-grained recognition. In RecViT, the deep layers of the original ViT are served as Head, a few middle layers as Neck and shallow layers as Backbone. In addition, we adopt an optional Feature Processing Module to enhance discriminative feature representation at each semantic level and align them for final recognition. With the above simple modifications, RecViT obtains significant improvement in accuracy in FGVC benchmarks: CUB-200-2011, Stanford Cars and Stanford Dogs.

Xuran Deng, Chuanbin Liu, Zhiying Lu
A Length-Sensitive Language-Bound Recognition Network for Multilingual Text Recognition

Due to the widespread use of English, considerable attention has been paid to scene text recognition with English as the target language, rather than multilingual scene text recognition. However, it is increasingly necessary to recognize multilingual texts with the continuous advancement of global integration. In this paper, a Length-sensitive Language-bound Recognition Network (LLRN) is proposed for multilingual text recognition. LLRN follows the traditional encoder-decoder structure. We improve the encoder and decoder respectively to better adapt to multilingual text recognition. On the one hand, we propose a Length-sensitive Encoder (LE) to encode features of different scales for long-text images and short-text images respectively. On the other hand, we present a Language-bound Decoder (LD). LD leverages language prior information to constrain the original output of the decoder to further modify the recognition results. Moreover, to solve the problem of multilingual data imbalance, we propose a Language-balanced Data Augmentation (LDA) approach. Experiments show that our method outperforms English-oriented mainstream models and achieves state-of-the-art results on MLT-2019 multilingual recognition benchmark.

Ming Gao, Shilian Wu, Zengfu Wang
Lightweight Multi-level Information Fusion Network for Facial Expression Recognition

The increasing capability of networks for facial expression recognition with disturbing factors is often accompanied by a large computational burden, which imposes limitations on practical applications. In this paper, we propose a lightweight multi-level information fusion network with distillation loss, which can be more lightweight compared with other methods under the premise of not losing accuracy. The multi-level information fusion block uses fewer parameters to focus on information from multiple levels with greater detail awareness, and the channel attention used in this block allows the network to concentrate more on sensitive information when processing facial images with disturbing factors. In addition, the distillation loss makes the network less susceptible to the errors of the teacher network. The proposed method has the fewest parameters of 0.98 million and GFLOPs of 0.142 compared with the state-of-the-art methods while achieving 88.95 $$\%$$ % , 64.77 $$\%$$ % , 60.63 $$\%$$ % , and 62.28 $$\%$$ % on the datasets RAF-DB, AffectNet-7, AffectNet-8, and SFEW, respectively. Abundantly experimental results show the effectiveness of the method. The code is available at https://github.com/Zzy9797/MLIFNet .

Yuan Zhang, Xiang Tian, Ziyang Zhang, Xiangmin Xu
Practical Analyses of How Common Social Media Platforms and Photo Storage Services Handle Uploaded Images

The research done in this study has delved deeply into the changes made to digital images that are uploaded to three of the major social media platforms and image storage services in today’s society: Facebook, Flickr, and Google Photos. In addition to providing up-to-date data on an ever-changing landscape of different social media networks’ digital fingerprints, a deep analysis of the social networks’ filename conventions has resulted in two new approaches in (i) estimating the true upload date of Flickr photos, regardless of whether the dates have been changed by the user or not, and regardless of whether the image is available to the public or has been deleted from the platform; (ii) revealing the photo ID of a photo uploaded to Facebook based solely on the file name of the photo.

Duc-Tien Dang-Nguyen, Vegard Velle Sjøen, Dinh-Hai Le, Thien-Phu Dao, Anh-Duy Tran, Minh-Triet Tran
CCF-Net: A Cascade Center-Based Framework Towards Efficient Human Parts Detection

Human parts detection has made remarkable progress due to the development of deep convolutional networks. However, many SOTA detection methods require large computational cost and are still difficult to be deployed to edge devices with limited computing resources. In this paper, we propose a lightweight Cascade Center-based Framework, called CCF-Net, for human parts detection. Firstly, a Gaussian-Induced penalty strategy is designed to ensure that the network can handle objects of various scales. Then, we use Cascade Attention Module to capture relations between different feature maps, which refines intermediate features. With our novel cross-dataset training strategy, our framework fully explores the datasets with incomplete annotations and achieves better performance. Furthermore, Center-based Knowledge Distillation is proposed to enable student models to learn better representation without additional cost. Experiments show that our method achieves a new SOTA performance on Human-Parts and COCO Human Parts benchmarks(The Datasets used in this paper were downloaded and experimented on by Kai Ye from Shenzhen University.).

Kai Ye, Haoqin Ji, Yuan Li, Lei Wang, Peng Liu, Linlin Shen
Low-Light Image Enhancement Under Non-uniform Dark

The low visibility of low-light images due to lack of exposure poses a significant challenge for vision tasks such as image fusion, detection and segmentation in low-light conditions. Real-world situations such as backlighting and shadow occlusion mostly exist with non-uniform low-light, while existing enhancement methods tend to brighten both low-light and normal-light regions, we actually prefer to enhance dark regions but suppress overexposed regions. To address this problem, we propose a new non-uniform dark visual network (NDVN) that uses the attention mechanism to enhance regions with different levels of illumination separately. Since deep-learning needs strong data-driven, for this purpose we carefully construct a non-uniform dark synthetic dataset (UDL) that is larger and more diverse than existing datasets, and more importantly it contains more non-uniform light states. We use the manually annotated luminance domain mask (LD-mask) in the dataset to drive the network to distinguish between low-light and extremely dark regions in the image. Guided by the LD-mask and the attention mechanism, the NDVN adaptively illuminates different light regions while enhancing the color and contrast of the image. More importantly, we introduce a new region loss function to constrain the network, resulting in better quality enhancement results. Extensive experiments show that our proposed network outperforms other state-of-the-art methods both qualitatively and quantitatively.

Yuhang Li, Feifan Cai, Yifei Tu, Youdong Ding
A Proposal-Improved Few-Shot Embedding Model with Contrastive Learning

Few-shot learning is increasingly popular in image classification. The key is to learn the significant features from source classes to match the support and query pairs. In this paper, we redesign the contrastive learning scheme in a few-shot manner with selected proposal boxes generated by Navigator network. The main work of this paper includes: (i) We analyze the limitation of hard sample generating proposed by current few-shot learning methods with contrastive learning and find additional noise introduced in contrastive loss construction. (ii) We propose a novel embedding model with contrastive learning named infoPB which improves hard samples with proposal boxes to improve Noise Contrastive Estimation. (iii) We demonstrate infoPB is effective in few-shot image classification and benefited from Navigator network through the ablation study. (iv) The performance of our method is evaluated thoroughly on typical few-shot image classification tasks. It verifies a new state-of-the-art performance compared with outstanding competitors with their best results on miniImageNet in 5-way, 5-shot, and tieredImageNet in 5-way, 1-shot/5-way, 5-shot.

Fucai Gong, Yuchen Xie, Le Jiang, Keming Chen, Yunxin Liu, Xiaozhou Ye, Ye Ouyang
Weighted Multi-view Clustering Based on Internal Evaluation

As real-world data are often represented by multiple sets of features in different views, it is desirable to improve clustering results with respect to ordinary single-view clustering by making use of the consensus and complementarity among different views. For this purpose, weighted multi-view clustering is proposed to combine multiple individual views into one single combined view, which is used to generate the final clustering result. In this paper we present a simple yet effective weighted multi-view clustering algorithm based on internal evaluation of clustering results. Observing that an internal evaluation criterion can be used to estimate the quality of clustering results, we propose to weight different views to maximize the clustering quality in the combined view. We firstly introduce an implementation of the Dunn index and a heuristic method to determine the scale parameter in spectral clustering. Then an adaptive weight initialization and updating method is proposed to improve the clustering results iteratively. Finally we do spectral clustering in the combined view to generate the clustering result. In experiments with several publicly available image and text datasets, our algorithm compares favorably or comparably with some other algorithms.

Haoqi Xu, Jian Hou, Huaqiang Yuan
BENet: Boundary Enhance Network for Salient Object Detection

Although deep convolutional networks have achieved good results in the field of salient object detection, most of these methods can not work well near the boundary. This results in poor boundary quality of network predictions, accompanied by a large number of blurred contours and hollow objects. To solve this problem, this paper proposes a Boundary Enhance Network (BENet) for salient object detection, which makes the network pay more attention to salient edge features by fusing auxiliary boundary information of objects. We adopt the Progressive Feature Extraction Module (PFEM) to obtain multi-scale edge and object features of salient objects. In response to the semantic gap problem in feature fusion, we propose an Adaptive Edge Fusion Module (AEFM) to allow the network to adaptively and complementarily fuse edge features and salient object features. The Self Refinement (SR) module further repairs and enhances edge features. Moreover, in order to make the network pay more attention to the boundary, we design an edge enhance loss function, which uses the additional boundary maps to guide the network to learn rich boundary features at the pixel level. Experimental results show that our proposed method outperforms state-of-the-art methods on five benchmark datasets.

Zhiqi Yan, Shuang Liang
PEFNet: Positional Embedding Feature for Polyp Segmentation

With the development of biomedical computing, the segmentation task is integral in helping the doctor correctly identify the position of the polyps or the ache in the system. However, precise polyp segmentation is challenging because the same type of polyps has a diversity of size, color, and texture; previous methods cannot fully transfer information from encoder to decoder due to the lack of details and knowledge of previous layers. To deal with this problem, we propose PEFNet, a novel model using modified UNet with a new Positional Embedding Feature block in the merging stage, which has more accuracy and generalization in polyps segmentation. The PEF block utilizes the information of the position, concatenated features, and extracted features to enrich the gained knowledge and improve the model’s comprehension ability. With EfficientNetV2-L as the backbone, we obtain the IOU score of 0.8201 and the Dice coefficient of 0.8802 on the Kvasir-SEG dataset. By PEFNet, we also take second place on the task Medico: Transparency in Medical Image Segmentation at MediaEval 2021, which is clear proof of the effectiveness of our models.

Trong-Hieu Nguyen-Mau, Quoc-Huy Trinh, Nhat-Tan Bui, Phuoc-Thao Vo Thi, Minh-Van Nguyen, Xuan-Nam Cao, Minh-Triet Tran, Hai-Dang Nguyen
MCOM-Live: A Multi-Codec Optimization Model at the Edge for Live Streaming

HTTP Adaptive Streaming (HAS) is the predominant technique to deliver video contents across the Internet with the increasing demand of its applications. With the evolution of videos to deliver more immersive experiences, such as their evolution in resolution and framerate, highly efficient video compression schemes are required to ease the burden on the delivery process. While AVC/H.264 still represents the most adopted codec, we are experiencing an increase in the usage of new generation codecs (HEVC/H.265, VP9, AV1, VVC/H.266, etc.). Compared to AVC/H.264, these codecs can either achieve the same quality besides a bitrate reduction or improve the quality while targeting the same bitrate. In this paper, we propose a Mixed-Binary Linear Programming (MBLP) model called Multi-Codec Optimization Model at the edge for Live streaming (MCOM-Live) to jointly optimize (i) the overall streaming costs, and (ii) the visual quality of the content played out by the end-users by efficiently enabling multi-codec content delivery. Given a video content encoded with multiple codecs according to a fixed bitrate ladder, the model will choose among three available policies, i.e., fetch, transcode, or skip, the best option to handle the representations. We compare the proposed model with traditional approaches used in the industry. The experimental results show that our proposed method can reduce the additional latency by up to 23% and the streaming costs by up to 78%, besides improving the visual quality of the delivered segments by up to 0.5 dB, in terms of PSNR.

Daniele Lorenzi, Farzad Tashtarian, Hadi Amirpour, Christian Timmerer, Hermann Hellwagner
LAE-Net: Light and Efficient Network for Compressed Video Action Recognition

Action recognition is a crucial task in computer vision and video analysis. The Two-stream network and 3D ConvNets are representative works. Although both of them have achieved outstanding performance, the optical flow and 3D convolution require huge computational effort, without taking into account the need for real-time applications. Current work extracts motion vectors and residuals directly from the compressed video to replace optical flow. However, due to the noisy and inaccurate representation of the motion, the accuracy of the model is significantly decreased when using motion vectors as input. Besides the current works focus only on improving accuracy or reducing computational cost, without exploring the tradeoff strategy between them. In this paper, we propose a light and efficient multi-stream framework, including a motion temporal fusion module (MTFM) and a double compressed knowledge distillation module (DCKD). MTFM improves the network’s ability to extract complete motion information and compensates to some extent for the problem of inaccurate description of motion information by motion vectors in compressed video. DCKD allows the student network to gain more knowledge from teacher with less parameters and input frames. Experimental results on the two public benchmarks(UCF-101 and HMDB-51) outperform the state of the art on the compressed domain.

Jinxin Guo, Jiaqiang Zhang, Xiaojing Zhang, Ming Ma
DARTS-PAP: Differentiable Neural Architecture Search by Polarization of Instance Complexity Weighted Architecture Parameters

Neural architecture search has attracted much attention because it can automatically find architectures with high performance. In recent years, differentiable architecture search emerges as one of the main techniques for automatic network design. However, related methods suffer from performance collapse due to excessive skip-connect operations and discretization gaps in search and evaluation. To relieve performance collapse, we propose a polarization regularizer on instance-complexity weighted architecture parameters to push the probability of the most important operation in each edge to 1 while the probabilities of other operations to 0. The polarization regularizer effectively removes the discretization gaps between the search and evaluation procedures, and instance-complexity aware learning of the architecture parameters gives higher weights to hard inputs therefore further improves the network performance. Similar to existing methods, the search process is conducted under a differentiable way. Extensive experiments on a variety of search spaces and datasets show our method can well polarize the architecture parameters and greatly reduce the number of skip-connect operations, which contributes to the performance elevation of network search.

Yunhong Li, Shuai Li, Zhenhua Yu
Pseudo-label Diversity Exploitation for Few-Shot Object Detection

Few-Shot Object Detection (FSOD) task is widely used in various data-scarce scenarios, aiming to expand the object detector with a few novel class samples. The current mainstream FSOD models improve the accuracy by mining novel class instances in the training set and fine-tuning the detector with mined pseudo set. Substantial progress has been made using pseudo-label approaches, but the impact of pseudo-labels diversity on FSOD tasks has not been explored. In our work, for the purpose of fully utilizing the pseudo-label set and exploring their diversity, we propose a new framework mainly including Novel Instance Bank (NIB) and Correlation-Guided Loss Correction (CGLC). Dynamically updated NIB stores the novel class instances to increase the diversity of novel instances in each batch. Moreover, to better exploit the pseudo-label diversity, CGLC adaptively employs k-shot samples to guide correct and incorrect pseudo-labels to pull away from each other. Experimental results on the MS-COCO dataset demonstrate the effectiveness of our method, which does not require any additional training samples or parameters. Our code is available at: https://github.com/lotuser1/PDE .

Song Chen, Chong Wang, Weijie Liu, Zhengjie Ye, Jiacheng Deng
HSS: A Hierarchical Semantic Similarity Hard Negative Sampling Method for Dense Retrievers

Dense Retriever (DR) for Open-domain textual question answering (OpenQA), which aims to retrieve passages from large data sources like Wikipedia or Google, has gained wide attention in recent years. Although DR models continuously refresh state-of-the-art performances, their improvement relies on negative sampling during the training process. Existing sampling strategies mainly focus on developing a complex algorithm based on computer science, and ignore the abundant semantic features of datasets. We discover that there exists obvious changes in semantic similarity and present a three-level hierarchy of semantic similarity: same topic, same class, other class, whose rationality is further demonstrated by ablation study. Based on this, we propose a hard negative sampling strategy named Hierarchical Semantic Similarity (HSS). Our HSS model performs negative sampling at semantic levels of topic and class, and experimental results on four datasets show that it achieves comparable or better retrieval performance compared with existing competitive baselines. The code is available in https://github.com/redirecttttt/HSS.

Xinjia Xie, Feng Liu, Shun Gai, Zhen Huang, Minghao Hu, Ankun Wang
Realtime Sitting Posture Recognition on Embedded Device

It is difficult to maintain a standard sitting posture for long periods of time, and a non-standard sitting posture can damage human health. Therefore, it is important to detect sitting posture in real time and remind users to adjust to a healthy sitting posture. Deep learning-based sitting posture recognition methods currently achieve better improvement in recognition accuracy, but the models cannot combine high accuracy and speed on embedded platforms, which in turn makes it difficult to be applied in edge intelligence. Aiming to overcome the challenge, we propose a fast sitting posture recognition method based on OpenPose, using a ShuffleNetV2 network to replace the original backbone network to extract the underlying features, and using a cosine information distance to find redundant filters to prune and optimize the model. Lightweight model after pruning for more efficient real-time interaction. At the same time, the sitting posture recognition method is improved by fusing joint distance and angle features on the basis of skeletal joint features to improve the accuracy while ensuring the recognition detection speed. The optimized model can not only run at 8 fps on the Jetson Nano embedded device, but can also ensure recognition accuracy of 94.73 $$\%$$ % . Experimental results show that the improved model can meet the real-time detection of sitting posture on embedded devices.

Jingsen Fang, Shoudong Shi, Yi Fang, Zheng Huo
Comparison of Deep Learning Techniques for Video-Based Automatic Recognition of Greek Folk Dances

Folk dances consist an important part of the Intangible Cultural Heritage (ICH) of each place. Nowadays, there is a great amount of videos related to folk dances. An automatic dance recognition algorithm can ease the management of this content and enforce the promotion of folk dances to the younger generations. Automatic dance recognition is still an open research area that belongs to the more general field of human activity recognition. Our work focuses on the exploration of existing deep neural network architectures for automatic recognition of Greek folk dances depicted in standard videos, as well as the experimentation with different representations of input. For our experiments, we have collected YouTube videos of Greek folk dances from north-eastern Greece. Specifically, we have validated three different deep neural network architectures using raw RGB and grayscale video frames, optical flow, as well as “visualised” multi-person 2D poses. In this paper, we describe our experiments, and, finally, we present the results and findings of the conducted research.

Georgios Loupas, Theodora Pistola, Sotiris Diplaris, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris
Dynamic Feature Selection for Structural Image Content Recognition

Structural image content recognition (SICR) aims to transcribe a two-dimensional structural image (e.g., mathematical expression, chemical formula, or music score) into a token sequence. Existing methods are mainly encoder-decoder based and overlook the importance of feature selection and spatial relation extraction in the feature map. In this paper, we propose DEAL (shorted for Dynamic fEAture seLection) for SICR, which contains a dynamic feature selector and a spatial relation extractor as two cornerstone modules. Specifically, we propose a novel loss function and random exploration strategy to dynamically select useful image cells for target sequence generation. Further, we consider the positional and surrounding information of cells in the feature map to extract spatial relations. We conduct extensive experiments to evaluate the performance of DEAL. Experimental results show that DEAL outperforms other state-of-the-arts significantly.

Yingnan Fu, Shu Zheng, Wenyuan Cai, Ming Gao, Cheqing Jin, Aoying Zhou
Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.

Ke Dong, Hao Peng, Jie Che
Research on Multi-task Semantic Segmentation Based on Attention and Feature Fusion Method

Recently, single-task learning on semantic segmentation tasks has achieved good results. When multiple tasks are handled simultaneously, single-task learning requires an independent network structure for each task and no intersection between tasks. This paper proposes a feature fusion-attention mechanism multi-task learning method, which can simultaneously handle multiple related tasks (semantic segmentation, surface normal estimation task, etc.). Our model includes a feature extraction module to extract semantic information at different scales, a feature fusion module to refine the extracted features, and an attentional mechanism for processing information from fusion modules to learn information about specific tasks. The network architecture proposed in this paper trains in an end-to-end manner and simultaneously improves the performance of multiple tasks. Experiments are carried out on two well-known semantic segmentation datasets, and the accuracy of the proposed model is verified.

Aimei Dong, Sidi Liu
Space-Time Video Super-Resolution 3D Transformer

Space-time video super-resolution, which aims to generate a high resolution (HR) and high frame rate (HRF) video from a low frame rate (LFR), low resolution (LR) video. Simply combining video frame interpolation (VFI) and video super-resolution (VSR) network to solve this problem cannot bring satisfying performance, which also requires a heavy computational burden. In this paper, we investigate a one-stage network to jointly up-sample video both in time and space. In our framework, a 3D pyramid structure with channel attention is proposed to fuse input frames and generate intermediate features. The features are fed into the 3D Transformer network to model global relationships between features. Our proposed network, 3DTFSR, can efficiently process videos without explicit motion compensation. Extensive experiments on benchmark datasets demonstrate that the proposed method achieves better quantitative and qualitative performance compared to a two-stage network.

Minyan Zheng, Jianping Luo
Graph-Based Data Association in Multiple Object Tracking: A Survey

In Multiple Object Tracking (MOT), data association is a key component of the tracking-by-detection paradigm and endeavors to link a set of discrete object observations across a video sequence, yielding possible trajectories. Our intention is to provide a classification of numerous graph-based works according to the way they measure object dependencies and their footprint on the graph structure they construct. In particular, methods are organized into Measurement-to-Measurement (MtM), Measurement-to-Track (MtT), and Track-to-Track (TtT). At the same time, we include recent Deep Learning (DL) implementations among traditional approaches to present the latest trends and developments in the field and offer a performance comparison. In doing so, this work serves as a foundation for future research by providing newcomers with information about the graph-based bibliography of MOT.

Despoina Touska, Konstantinos Gkountakos, Theodora Tsikrika, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris
Multi-view Adaptive Bone Activation from Chest X-Ray with Conditional Adversarial Nets

Activating bone from a chest X-ray (CXR) is significant for disease diagnosis and health equity for under-developed areas, while the complex overlap of anatomical structures in CXR constantly challenges bone activation performance and adaptability. Besides, due to high data collection and annotation costs, no large-scale labeled datasets are available. As a result, existing methods commonly use single-view CXR with annotations to activate bone. To address these challenges, in this paper, we propose an adaptive bone activation framework. This framework leverages the Dual-Energy Subtraction (DES) images to consist of multi-view image pairs of the CXR and the contrastive learning theory to construct training samples. In particular, we first devise a Siamese/Triplet architecture supervisor; correspondingly, we establish a cGAN-styled activator based on the learned skeletal information to generate the bone image from the CXR. To our knowledge, the proposed method is the first multi-view bone activation framework obtained without manual annotation and has more robust adaptability. The mean of Relative Mean Absolute Error ( $$\overline{RMAE}$$ RMAE ¯ ) and the Fréchet Inception Distance (FID) are 3.45% and 1.12 respectively, which proves the results activated by our method retain more skeletal details with few feature distribution changes. From the visualized results, our method can activate bone images from a single CXR ignoring overlapping areas. Bone activation has drastically improved compared to the original images.

Chaoqun Niu, Yuan Li, Jian Wang, Jizhe Zhou, Tu Xiong, Dong Yu, Huili Guo, Lin Zhang, Weibo Liang, Jiancheng Lv
Multimodal Reconstruct and Align Net for Missing Modality Problem in Sentiment Analysis

Multimodal Sentiment Analysis (MSA) aims at recognizing emotion categories by textual, visual, and acoustic cues. However, in real-life scenarios, one or two modalities may be missing due to various reasons. And when text modality is missing, obvious deterioration will be observed since text modality contains much more semantic information compared to vision and audio modality. To this end, we propose the Multimodal Reconstruct and Align Net (MRAN) to tackle the missing modality problem, especially to relieve the decline caused by the text modality’s absence. We first propose the Multimodal Embedding and Missing Index Embedding to guide the reconstruction of missing modalities features. Then, visual and acoustic features are projected to the textual feature space, and all three modalities’ features are learned to be close to the word embedding of their corresponding emotion category, making visual and acoustic features aligned with textual features. In this text-centered way, vision and audio modality benefit from the more informative text modality. Thus it improves the robustness of the network for different modality missing conditions, especially when text modality is missing. Experimental results conducted on two multimodal benchmarks IEMOCAP and CMU-MOSEI show that our method outperforms baseline methods, gaining superior results on different kinds of modality missing conditions.

Wei Luo, Mengying Xu, Hanjiang Lai
Lightweight Image Hashing Based on Knowledge Distillation and Optimal Transport for Face Retrieval

This paper proposes a lightweight image hashing based on knowledge distillation and optimal transport for face retrieval. A key contribution is the attention-based triplet knowledge distillation, whose loss function includes attention loss, Kullback-Leibler (KL) loss and identity loss. It can significantly reduce network size with almost no decrease of retrieval performance. Another contribution is the hash quantization based on optimal transport. It partitions the face feature space by calculating class-centers and conducts binary quantization based on the optimal transport. It can make performance improvement in face retrieval with short bit length. In addition, an alternating training strategy is designed for tuning network parameters of our lightweight hashing. Many experiments on two face datasets are carried out to test performances of the proposed lightweight hashing. Retrieval comparisons illustrate that the proposed lightweight hashing outperforms some well-known hashing methods.

Ping Feng, Hanyun Zhang, Yingying Sun, Zhenjun Tang
CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

As a fundamental task in the multimodal domain, text-to-video retrieval task has received great attention in recent years. Most of the current research focuses on the interaction between cross-modal coarse-grained features. However, the feature granularity of retrieval models has not been fully explored. Therefore, we introduce video internal region information into cross-modal retrieval and propose a cross-model fine-grained feature retrieval framework. Videos are represented as video-frame-region triple features, texts are represented as sentence-word dual features, and the cross-similarity between visual features and text features is computed through token-wise interaction. It effectively extracts the detailed information in the video, guides the model to pay attention to the effective video region information and keyword information in the sentence, and reduces the adverse effects of redundant words and interfering frames. On the most popular retrieval dataset MSRVTT, the framework achieves state-of-the-art results (51.1@1). Excellent experimental results demonstrate the superiority of fine-grained feature interaction.

Shengwei Zhao, Yuying Liu, Shaoyi Du, Zhiqiang Tian, Ting Qu, Linhai Xu
Transferable Adversarial Attack on 3D Object Tracking in Point Cloud

3D point cloud tracking has recently witnessed considerable progress with deep learning. Such progress, however, mainly focuses on improving tracking accuracy. The risk, especially considering that deep neural network is vulnerable to adversarial perturbations, of a tracker being attacked is often neglected and rarely explored. In order to attract attentions to this potential risk and facilitate the study of robustness in point cloud tracking, we introduce a novel transferable attack network (TAN) to deceive 3D point cloud tracking. Specifically, TAN consists of a 3D adversarial generator, which is trained with a carefully designed multi-fold drift (MFD) loss. The MFD loss considers three common grounds, including classification, intermediate feature and angle drifts, across different 3D point cloud tracking frameworks for perturbation generation, leading to high transferability of TAN for attack. In our extensive experiments, we demonstrate the proposed TAN is able to not only drastically degrade the victim 3D point cloud tracker, i.e., P2B [21], but also effectively deceive other unseen state-of-the-art approaches such as BAT [33] and M $$^{2}$$ 2 Track [34], posing a new threat to 3D point cloud tracking. Code will be available at https://github.com/Xiaoqiong-Liu/TAN .

Xiaoqiong Liu, Yuewei Lin, Qing Yang, Heng Fan
A Spectrum Dependent Depth Layered Model for Optimization Rendering Quality of Light Field

Light field rendering technology is an important tool, which is applied a set of multi view image to render realistic novel views and experiences through some simple interpolation. However, the rendered novel views often have various distortions or low quality due to complexity of the scene, e.g., occlusion and non Lambertian. The distortion of novel views in the spectrum of light field signal can be reflected in periodic aliasing. In this paper, we propose a spectrum dependent depth layered (SDDL) model to eliminate the spectrum aliasing of light field signals, so as to improve the rendering quality of novel views. The SDDL model is about taking advantage of the characteristics that light field signal spectrum structure is only limited by the minimum depth and the maximum depth. So we in manner of increasing the depth of the layer between the minimum and maximum depth, it will reduce the sampling interval between cameras, and the adjacent two sampling during the period of the spectrum interval will become bigger. Thus, the aliasing of novel views will become smaller, it can be eliminated by this method to achieve the purpose of aliasing. In fact, the result of experiment prove our method can improve the rendering quality of light field.

Xiangqi Gan, Changjian Zhu, Mengqin Bai, YingWei, Weiyan Chen
Transformer-Based Cross-Modal Recipe Embeddings with Large Batch Training

Cross-modal recipe retrieval aims to exploit the relationships and accomplish mutual retrieval between recipe images and texts, which is clear for human but arduous to formulate. Although many previous works endeavored to solve this problem, most works did not efficiently exploit the cross-modal information among recipe data. In this paper, we present a frustratingly straightforward cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT) achieving high performance on both recipe retrieval and image generation tasks, which is designed to efficiently exploit the rich cross-modal information. In our proposed framework, Transformer-based encoders are applied for both image and text encoding for cross-modal embedding learning. We also adopt several loss functions like self-supervised learning loss on recipe text to encourage the model to further promote the cross-modal embedding learning. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. The experimental results showed that TNLBT significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M by a huge margin. We also found that CLIP-ViT performs better than ViT-B as the image encoder backbone. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embedding learning.

Jing Yang, Junwen Chen, Keiji Yanai
Self-supervised Multi-object Tracking with Cycle-Consistency

Multi-object tracking is a challenging video task that requires both locating the objects in the frames and associating the objects among the frames, which usually utilizes the tracking-by-detection paradigm. Supervised multi-object tracking methods have made stunning progress recently, however, the expensive annotation costs for bounding boxes and track ID labels limit the robustness and generalization ability of these models. In this paper, we learn a novel multi-object tracker using only unlabeled videos by designing a self-supervisory learning signal for an association model. Specifically, inspired by the cycle-consistency used in video correspondence learning, we propose to track the objects forwards and backwards, i.e., each detection in the first frame is supposed to be matched with itself after the forward-backward tracking. We utilize this cycle-consistency as the self-supervisory learning signal for our proposed multi-object tracker. Experiments conducted on the MOT17 dataset show that our model is effective in extracting discriminative association features, and our tracker achieves competitive performance compared to other trackers using the same pre-generated detections, including UNS20 [1], Tracktor++ [2], FAMNet [8], and CenterTrack [31].

Yuanhang Yin, Yang Hua, Tao Song, Ruhui Ma, Haibing Guan
Video-Based Precipitation Intensity Recognition Using Dual-Dimension and Dual-Scale Spatiotemporal Convolutional Neural Network

This paper proposes the dual-dimension and dual-scale spatiotemporal convolutional neural network, namely DDS-CNN, which consists of two modules, the global spatiotemporal module (GSM) and the local spatiotemporal module (LSM), for precipitation intensity recognition. The GSM uses 3D LSTM operations to study the influence of the relationship between sampling points on precipitation. The LSM takes 4D convolution operations and forms the convolution branches with various convolution kernels to learn the rain pattern of different precipitation. We evaluate the performance of DDS-CNN using the self-collected dataset, IMLab-RAIN-2018, and compare it with the state-of-the-art 3D models. DDS-CNN has the highest overall accuracy and achieves 98.63%. Moreover, we execute the ablation experiments to prove the effectiveness of the proposed modules.

Chih-Wei Lin, Zhongsheng Chen, Xiuping Huang, Suhui Yang
Low-Light Image Enhancement Based on U-Net and Haar Wavelet Pooling

The inevitable environmental and technical limitations of image capturing has as a consequence that many images are frequently taken in inadequate and unbalanced lighting conditions. Low-light image enhancement has been very popular for improving the visual quality of image representations, while low-light images often require advanced techniques to improve the perception of information for a human viewer. One of the main objectives in increasing the lighting conditions is to retain patterns, texture, and style with minimal deviations from the considered image. To this direction, we propose a low-light image enhancement method with Haar wavelet-based pooling to preserve texture regions and increase their quality. The presented framework is based on the U-Net architecture to retain spatial information, with a multi-layer feature aggregation (MFA) method. The method obtains the details from the low-level layers in the stylization processing. The encoder is based on dense blocks, while the decoder is the reverse of the encoder, and extracts features that reconstruct the image. Experimental results show that the combination of the U-Net architecture with dense blocks and the wavelet-based pooling mechanism comprises an efficient approach in low-light image enhancement applications. Qualitative and quantitative evaluation demonstrates that the proposed framework reaches state-of-the-art accuracy but with less resources than LeGAN.

Elissavet Batziou, Konstantinos Ioannidis, Ioannis Patras, Stefanos Vrochidis, Ioannis Kompatsiaris
Audio-Visual Sensor Fusion Framework Using Person Attributes Robust to Missing Visual Modality for Person Recognition

Audio-visual person recognition is the problem of recognizing an individual person class defined by the training data from the multimodal audio-visual data. Audio-visual person recognition has many applications in security, surveillance, biometrics etc. Deep learning-based audio-visual person recognition report state-of-the-art person recognition accuracy. However, existing audio-visual frameworks require the presence of both modalities, and this approach is limited by the problem of missing modalities, where one or more of the modalities could be missing. In this paper, we formulate an audio-visual person recognition framework where we define and address the missing visual modality problem. The proposed framework enhances the robustness of audio-visual person recognition even under the condition of missing visual modality using audio-based person attributes and a multi-head attention transformer-based network, termed the CNN Transformer Network (CTNet). The audio-based person attributes such as age, gender and race are predicted from the audio data using a deep learning model, termed the Speech-to-Attribute Network (S2A network). The attributes predicted from the audio data, which are assumed to be always available, provide additional cues for the person recognition framework. The predicted attributes, the audio data and the image data, which may be missing, are given as input to the CTNet, which contains the multi-head attention branch. The multi-head attention branch addresses the problem of missing visual modality by assigning attention weights to the audio features, visual features and the audio-based attributes. The proposed framework is validated with the CREMA-D public dataset using a comparative analysis and an ablation study. The results show that the proposed framework enhances the robustness of person recognition even under the condition of missing visible camera.

Vijay John, Yasutomo Kawanishi
Rumor Detection on Social Media by Using Global-Local Relations Encoding Network

With the rapid development of the Internet, social media has become the main platform for users to obtain news and share their opinions. While social media provides convenience to the life of people, it also offers advantageous conditions for publishing and spreading rumors. Since artificial detection methods take a lot of time, it becomes crucial to use intelligent methods for rumor detection. The recent rumor detection methods mostly use the meta-paths of post propagation to construct isomorphic graphs to find clues in the propagation structure. However, these methods do not fully use the global and local relations in the propagation graph and do not consider the correlation between different types of nodes. In this paper, we propose a Global-Local Relations Encoding Network (GLREN), which encodes node relations in the heterogeneous graph from global and local perspectives. First, we explore the semantic similarity between all source posts and comment posts to generate global and local semantic representations. Then, we construct user credibility levels and interaction relations to explore the potential relationship between users and misinformation. Finally, we introduce a root enhancement strategy to enhance the influence of source posts and publisher information. The experimental results show that our model can outperform the accuracy of the state-of-the-art methods by 3.0% and 6.0% on Twitter15 and Twitter16, respectively.

Xinxin Zhang, Shanliang Pan, Chengwu Qian, Jiadong Yuan
Unsupervised Encoder-Decoder Model for Anomaly Prediction Task

For the anomaly detection task of video sequences, CNN-based methods have been able to learn to describe the normal situation without abnormal samples at training time by reconstructing the input frame or predicting the future frame, and then use the reconstruction error to represent the abnormal situation at testing time. Transformers, however, have achieved the same spectacular outcomes as CNN on many tasks after being utilized in the field of vision, and they have also been used in the task of anomaly detection. We present an unsupervised learning method based on Vision Transformer in this work. The model has an encoder-decoder structure, and the Memory module is used to extract and enhance the video sequence's local pattern features. We discover anomalies in various data sets and visually compare distinct scenes in the data set. The experimental results suggest that the model has a significant impact on the task of dealing with anomalies.

Jinmeng Wu, Pengcheng Shu, Hanyu Hong, Xingxun Li, Lei Ma, Yaozong Zhang, Ying Zhu, Lei Wang
CTDA: Contrastive Temporal Domain Adaptation for Action Segmentation

In video action segmentation scenarios, intelligent models require sufficient training data. However, the significant expense of human annotation for action segmentation makes this method prohibitively expensive, and only very limited training videos can be accessible. Further, large Spatio-temporal variations exist in training and test data. Therefore, it is critical to have effective representations with few training videos and efficiently utilize unlabeled test videos. To this end, we firstly present a brand new Contrastive Temporal Domain Adaptation (CTDA) framework for action segmentation. Specifically, in the self-supervised learning module, two auxiliary tasks have been defined for binary and sequential domain prediction. They are then addressed by the combination of domain adaptation and contrastive learning. Further, a multi-stage architecture is devised to acquire the comprehensive results of action segmentation. Thorough experimental evaluation shows that the CTDA framework achieved the highest action segmentation performance.

Hongfeng Han, Zhiwu Lu, Ji-Rong Wen
Multi-scale and Multi-stage Deraining Network with Fourier Space Loss

The goal of rain streak removal is to recover the rain-free background scenes of an image degraded by rain streaks. Most current deep convolutional neural networks methods have achieved dramatic performance. However, these methods still cannot capture the discriminative features to well distinguish the rain streaks and the important image content. To solve this problem, we propose a Multi-scale and Multi-stage deraining network in the end-to-end manner. Specifically, we design a multi-scale rain streak extraction module to capture complex rain streak features across different scales through the multi-scale selection kernel attention mechanism. In addition, multi-stage learning is used to extract deeper feature representations of rain streak and fuse different stages of background information. Furthermore, we introduce a Fourier space loss function to reduce the loss of high-frequency information in the background image and improve the quality of deraining results. Extensive experiments demonstrate that our network performs favorably against the state-of-the-art deraining methods.

Zhaoyong Yan, Liyan Ma, Xiangfeng Luo, Yan Sun
DHP: A Joint Video Download and Dynamic Bitrate Adaptation Algorithm for Short Video Streaming

With the development of multimedia technology and the upgrading of mobile terminal equipment, short video platforms and applications are becoming more and more popular. Compared with traditional long video, short video users tend to slide from current viewing video more frequently. Unviewed preloaded video chunks cause a large amount of bandwidth waste and do not contribute to improving the user QoE. Since bandwidth savings conflict with user QoE improvements, it is very challenging to satisfy both. To solve this problem, this paper proposes DHP, a joint video download and dynamic bitrate adaptation algorithm for short video streaming. DHP makes the chunk download decision based on the maximum buffer model and retention rate, and makes the dynamic bitrate adaptation decision according to past bandwidth and buffer size. Experimental results show that DHP can reduce the bandwidth waste by at most 66.74% and improve the QoE by at most 42.5% compared to existing solutions under various network conditions.

Wenhua Gao, Lanju Zhang, Hao Yang, Yuan Zhang, Jinyao Yan, Tao Lin
Generating New Paintings by Semantic Guidance

In order to facilitate the human painting process, numerous research efforts have been made on teaching machines how to “paint like a human”, which is a challenging problem. Recent stroke-based rendering algorithms generate non-photorealistic imagery using a number of strokes to mimic a target image. However, the applicability of previous methods can only draw the content of one target image on a canvas that limits generation ability. We propose a novel painting approach which teach machines to paint with multiple target images and then generate new paintings. We consider the order of human painting and propose a combined stroke rendering method that can merge the content of multiple images into the same painting. We use semantic segmentation to obtain semantic information in multiple images, and add the semantic information in different images to the same painting process. Finally, our model can generate new paintings with contents from different images with the guidance of this semantic information. Experimental results demonstrate that our model can effectively generate new paintings which can assist human beings to create.

Ting Pan, Fei Wang, Junzhou Xie, Weifeng Liu
A Multi-Stream Fusion Network for Image Splicing Localization

In this paper, we address the problem of image splicing localization with a multi-stream network architecture that processes the raw RGB image in parallel with other handcrafted forensic signals. Unlike previous methods that either use only the RGB images or stack several signals in a channel-wise manner, we propose an encoder-decoder architecture that consists of multiple encoder streams. Each stream is fed with either the tampered image or handcrafted signals and processes them separately to capture relevant information from each one independently. Finally, the extracted features from the multiple streams are fused in the bottleneck of the architecture and propagated to the decoder network that generates the output localization map. We experiment with two handcrafted algorithms, i.e., DCT and Splicebuster. Our proposed approach is benchmarked on three public forensics datasets, demonstrating competitive performance against several competing methods and achieving state-of-the-art results, e.g., 0.898 AUC on CASIA.

Maria Siopi, Giorgos Kordopatis-Zilos, Polychronis Charitidis, Ioannis Kompatsiaris, Symeon Papadopoulos
Fusion of Multiple Classifiers Using Self Supervised Learning for Satellite Image Change Detection

Deep learning methods are widely used in the domain of change detection in remote sensing images. While datasets of that kind are abundant, annotated images, specific for the task at hand, are still scarce. Neural networks trained with Self supervised learning aim to harness large volumes of unlabeled satellite high resolution images to help in finding better solutions for the change detection problem. In this paper we experiment with this approach by presenting 4 different change detection methodologies. We propose a fusion method that under specific parameters can provide better results. We evaluate our results using two openly available datasets with Sentinel-2 satellite images, S2MTCP and OSCD, and we investigate the impact of using 2 different Sentinel 2 band combinations on our final predictions. Finally we conclude by summarizing the benefits of this approach as well as we propose future areas of interest that could be of value in enhancing the change detection task’s outcomes.

Alexandros Oikonomidis, Maria Pegia, Anastasia Moumtzidou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis Kompatsiaris
Improving the Robustness to Variations of Objects and Instructions with a Neuro-Symbolic Approach for Interactive Instruction Following

An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions to interact with objects in 3D environments. We found that an existing end-to-end neural model for this task tends to fail to interact with objects of unseen attributes and follow various instructions. We assume that this problem is caused by the high sensitivity of neural feature extraction to small changes in vision and language inputs. To mitigate this problem, we propose a neuro-symbolic approach that utilizes high-level symbolic features, which are robust to small changes in raw inputs, as intermediate representations. We verify the effectiveness of our model with the subtask evaluation on the ALFRED benchmark. Our experiments show that our approach significantly outperforms the end-to-end neural model by 9, 46, and 74 points in the success rate on the ToggleObject, PickupObject, and SliceObject subtasks in unseen environments respectively.

Kazutoshi Shinoda, Yuki Takezawa, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo
Interpretable Driver Fatigue Estimation Based on Hierarchical Symptom Representations

Traffic accidents caused by driver fatigue lead to millions of death and financial loss every year. Current end-to-end methods for driver fatigue detection are not capable of distinguishing the detailed fatigue symptoms and interpretably inferring the fatigue state. In this paper, we propose an interpretable driver fatigue detection method with hierarchical fatigue symptom representations. In pursuit of a more general and interpretable driver fatigue detection approach, we propose to detect detailed fatigue symptoms before driver state inferring. First of all, we propose a hierarchical method that accurately classifies abnormal behaviors into detailed fatigue symptoms. Moreover, to fuse the fatigue symptom detection results accurately and efficiently, we propose an effective and interpretable fatigue estimation method with maximum a posteriori and experience constraints. Finally, we evaluate the proposed method on the driver fatigue detection benchmark dataset, and the experimental results endorse the feasibility and effectiveness of the proposed method.

Jiaqin Lin, Shaoyi Du, Yuying Liu, Zhiqiang Tian, Ting Qu, Nanning Zheng
VAISL: Visual-Aware Identification of Semantic Locations in Lifelog

Organising and preprocessing are crucial steps in order to perform analysis on lifelogs. This paper presents a method for preprocessing, enriching, and segmenting lifelogs based on GPS trajectories and images captured from wearable cameras. The proposed method consists of four components: data cleaning, stop/trip point classification, post-processing, and event characterisation. The novelty of this paper lies in the incorporation of a visual module (using a pretrained CLIP model) to improve outlier detection, correct classification errors, and identify each event’s movement mode or location name. This visual component is capable of addressing imprecise boundaries in GPS trajectories and the partition of clusters due to data drift. The results are encouraging, which further emphasises the importance of visual analytics for organising lifelog data.

Ly-Duyen Tran, Dongyun Nie, Liting Zhou, Binh Nguyen, Cathal Gurrin
Multi-scale Gaussian Difference Preprocessing and Dual Stream CNN-Transformer Hybrid Network for Skin Lesion Segmentation

Skin lesions segmentation from dermoscopic images has been a long-standing challenging problem, which is important for improving the analysis of skin cancer. Due to the large variation of melanin in the lesion area, the large number of hairs covering the lesion area, and the unclear boundary of the lesion, most previous works were hard to accurately segment the lesion area. In this paper, we propose a Multi-Scale Gaussian Difference Preprocessing and Dual Stream CNN-Transformer Hybrid Network for Skin Lesion Segmentation, which can accurately segment a high-fidelity lesion area from a dermoscopic image. Specifically, we design three specific sets of Gaussian difference convolution kernels to significantly enhance the lesion area and its edge information, conservatively enhance the lesion area and its edge information, and remove noise features such as hair. Through the information enhancement of multi-scale Gaussian convolution, the model can easily extract and represent the enhanced lesion information and lesion edge information while reducing the noise information. Secondly, we adopt dual steam network to extract features from the Gaussian difference image and the original image separately and fuse them in the feature space to accurately align the feature information. Thirdly, we apply the convolution neural network (CNN) and vision transformer (ViT) hybrid architectures to better exploit the local and global information. Finally, we use the coordinate attention mechanism and the self-attention mechanism to enhance the sensitivity to the necessary features. Extensive experimental results on the ISIC 2016, PH2, and ISIC 2018 dataset demonstrate that our proposed approach achieves compelling performance in skin lesions segmentation.

Xin Zhao, Zhihang Ren
AutoRF: Auto Learning Receptive Fields with Spatial Pooling

The search space is crucial in neural architecture search (NAS), and can determine the upper limit of the performance. Most methods focus on the design of depth and width when designing the search space, ignoring the receptive field. With a larger receptive field, the model is able to aggregate hierarchical information and strengthen its representational power. However, expanding the receptive fields directly with large convolution kernels suffers from high computational complexity. We instead enlarge the receptive field by introducing pooling operations with little overhead. In this paper, we propose a method named Auto Learning Receptive Fields (AutoRF), which is the first attempt at the auto attention module design with regard to the adaptive receptive field. In this paper, we present a pooling-based auto-learning approach for receptive field search. Our proposed search space encompasses typical multi-scale receptive field integration modules theoretically. Detailed experiments demonstrate the generalization ability of AutoRF and outperform various hand-crafted methods as well as NAS-based ones.

Peijie Dong, Xin Niu, Zimian Wei, Hengyue Pan, Dongsheng Li, Zhen Huang
In-Air Handwritten Chinese Text Recognition with Attention Convolutional Recurrent Network

In-air handwriting is a new and more humanized way of human-computer interaction, which has a broad application prospect. One of the existing online handwritten Chinese text recognition model is to convert the trajectory data into image-like representation and use two-dimensional convolutional neural network (2DCNN) for feature extraction, and the another one directly process trajectory sequence with Long Short-Term Memory (LSTM). However, when using 2DCNN, many information will be lost in the process of conversion into images. When using LSTM, LSTM network is easy to cause gradient problem. So we propose an attention convolutional recurrent network (ACRN) for in-air handwritten Chinese text, which introduces one-dimensional convolutional neural network (1DCNN) containing dilation convolution for feature extraction of trajectory data directly. After that, the ACRN uses LSTM combined with multihead attention mechanism to focus on some key words in handwritten Chinese text, mines multi-level dependencies and outputs to softmax for classification. Finally the ACRN uses the Connectionist Temporal Classification (CTC) objective function without input-output alignment to decode the coding results. We conduct experiments on the CASIA-OLHWDB2.0-2.2 dataset and in-air handwritten Chinese text dataset IAHCT-UCAS2018. Experimental results demonstrate that compared with previous methods, our method obtains a more compact model with a higher recognition accuracy.

Zhihong Wu, Xiwen Qu, Jun Huang, Xuangou Wu

BNI: Brave New Ideas

Frontmatter
Multimedia Datasets: Challenges and Future Possibilities

Public multimedia datasets can enhance knowledge discovery and model development as more researchers have the opportunity to contribute to exploring them. However, as these datasets become larger and more multimodal, besides analysis, efficient storage and sharing can become a challenge. Furthermore, there are inherent privacy risks when publishing any data containing sensitive information about the participants, especially when combining different data sources leading to unknown discoveries. Proposed solutions include standard methods for anonymization and new approaches that use generative models to produce fake data that can be used in place of real data. However, there are many open questions regarding whether these generative models hold information about the data used to train them and if this information could be retrieved, making them not as privacy-preserving as one may think. This paper reviews some important milestones that the research community has reached so far in important challenges in multimedia data analysis. In addition, we discuss the long-term and short-term challenges associated with publishing open multimedia datasets, including questions regarding efficient sharing, data modeling, and ensuring that the data is appropriately anonymized.

Thu Nguyen, Andrea M. Storås, Vajira Thambawita, Steven A. Hicks, Pål Halvorsen, Michael A. Riegler
The Importance of Image Interpretation: Patterns of Semantic Misclassification in Real-World Adversarial Images

Adversarial images are created with the intention of causing an image classifier to produce a misclassification. In this paper, we propose that adversarial images should be evaluated based on semantic mismatch, rather than label mismatch, as used in current work. In other words, we propose that an image of a “mug” would be considered adversarial if classified as “turnip”, but not as “cup”, as current systems would assume. Our novel idea of taking semantic misclassification into account in the evaluation of adversarial images offers two benefits. First, it is a more realistic conceptualization of what makes an image adversarial, which is important in order to fully understand the implications of adversarial images for security and privacy. Second, it makes it possible to evaluate the transferability of adversarial images to a real-world classifier, without requiring the classifier’s label set to have been available during the creation of the images. The paper carries out an evaluation of a transfer attack on a real-world image classifier that is made possible by our semantic misclassification approach. The attack reveals patterns in the semantics of adversarial misclassifications that could not be investigated using conventional label mismatch.

Zhengyu Zhao, Nga Dang, Martha Larson

Research2Biz

Frontmatter
Students Take Charge of Climate Communication

It is an arduous task to communicate the gravity and complexity of climate change in an engaging and fact-based manner. One might even call this a wicked problem; a problem that is difficult to define and does not have one specific solution. Climate communication is a complex societal challenge that needs processes of dialogue and argumentation between a variety of stakeholders to be tackled. In this paper we present a pedagogical approach where thirty-one undergraduate Media and Interaction Design students collaborated with media companies to explore innovative ways to communicate climate change, and make it more engaging for citizens. The students conducted multi-method evaluations of existing journalistic work with citizens from a variety of demographics, and then conceptualized and developed innovative prototypes for communicating climate change causes, impacts, and future potentials. This project demonstrates the potential of innovation pedagogy to establish transdisciplinary collaboration between students and industry partners when dealing with the wicked problems of climate change communication. We explain how the pedagogical method works, describe the results of the collaboration, and discuss the outcome. While the approach has potential to improve climate change communication, it also leads to tension among students due to its normative, problem-oriented approach.

Fredrik Håland Jensen, Oda Elise Nordberg, Andy Opel, Lars Nyre

Demo

Frontmatter
Social Relation Graph Generation on Untrimmed Video

For a more intuitive understanding of videos, we demonstrate SRGG-UnVi, a social relation graph generation system for untrimmed videos. Given a video, the demonstration can combine existing knowledge to build a dynamic relation graph and a static multi-relation graph. SRGG-UnVi integrates various multimodal technologies, including Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), face recognition and clustering, multimodal video relation extraction, etc. The system consists of three modules: (1) The video process engine takes advantage of parallelization, efficiently providing multimodal information to other modules. (2) The relation recognition module utilize multimodal information to extract the relationship between characters in each scene. (3) The graph generation module generates social relation graph for users.

Yibo Hu, Chenghao Yan, Chenyu Cao, Haorui Wang, Bin Wu
Improving Parent-Child Co-play in a Roblox Game

Co-play of digital games between parents and their children is a fruitful but underutilized parental mediation strategy. Previous research on this topic has resulted in various design recommendations meant to support and encourage co-play. However, most of these recommendations have yet to be applied and systematically validated within co-play focused games. Based on such design recommendations, our demo paper bridges this research gap by advancing the co-play experience of an existing Roblox game, Funomena’s Magic Beanstalk. In our study, we departed from a subset of potential design recommendations to redesign two of Magic Beanstalk’s mini-games. The two in-house redesigned mini-games were then evaluated by parent-child dyads in a qualitative evaluation, comparing the co-play experience of the original and of our redesigned games. This initial evaluation demonstrates that designing games according to established design recommendations has the potential to improve co-play experiences.

Jonathan Geffen
Taylor – Impersonation of AI for Audiovisual Content Documentation and Search

While AI-based audiovisual analysis tools have without doubt made huge progress, integrating them in media production and archiving workflows is still challenging, as the provided annotations may not match needs in terms of type, granularity and accuracy of metadata, and do not well align with existing workflows. We propose a system for annotation and search in media archive applications, using a range of AI-based analysis methods. In order to facilitate communication of explanations and collect relevance feedback, an impersonation of the systems’ intelligence, named Taylor, is included as an element of the user interface.

Victor Adriel de Jesus Oliveira, Gernot Rottermanner, Magdalena Boucher, Stefanie Größbacher, Peter Judmaier, Werner Bailer, Georg Thallinger, Thomas Kurz, Jakob Frank, Christoph Bauer, Gabriele Fröschl, Michael Batlogg
Virtual Try-On Considering Temporal Consistency for Videoconferencing

Virtual fitting, in which a person’s image is changed to an arbitrary clothing image, is expected to be applied to shopping sites and videoconferencing. In real-time virtual fitting, image-based methods using a knowledge distillation technique can generate high-quality fitting images by inputting only the image of arbitrary clothing and a person without requiring the additional data like pose information. However, there are few studies that perform fast virtual fitting from arbitrary clothing images stably with real person images for situations such as videoconferencing considering temporal consistency. Therefore, the purpose of this demo is to perform robust virtual fitting with temporal consistency for videoconferencing. First, we created a virtual fitting system and verified how effective the existing fast image fitting method is for webcam video. The results showed that the existing methods do not adjust the dataset and do not consider temporal consistency, and thus are unstable for input images similar to videoconferencing. Therefore, we propose to train a model that adjusts the dataset to be similar to a videoconference and to add temporal consistency loss. Qualitative evaluation of the proposed model confirms that the model exhibits less flicker than the baseline. Figure 1 shows an example usage of our try-on system which is running on Zoom.

Daiki Shimizu, Keiji Yanai
Backmatter
Metadata
Title
MultiMedia Modeling
Editors
Duc-Tien Dang-Nguyen
Cathal Gurrin
Martha Larson
Alan F. Smeaton
Stevan Rudinac
Minh-Son Dao
Christoph Trattner
Phoebe Chen
Copyright Year
2023
Electronic ISBN
978-3-031-27818-1
Print ISBN
978-3-031-27817-4
DOI
https://doi.org/10.1007/978-3-031-27818-1

Premium Partner