MultiMedia Modeling
27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II
- 2021
- Buch
- Herausgegeben von
- Jakub Lokoč
- Prof. Tomáš Skopal
- Prof. Dr. Klaus Schoeffmann
- Vasileios Mezaris
- Dr. Xirong Li
- Dr. Stefanos Vrochidis
- Dr. Ioannis Patras
- Buchreihe
- Lecture Notes in Computer Science
- Verlag
- Springer International Publishing
Über dieses Buch
Über dieses Buch
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.
Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were accepted as well as 2 papers for a demo presentation and 17 papers for participation at the Video Browser Showdown 2021. The papers cover topics such as: multimedia indexing; multimedia mining; multimedia abstraction and summarization; multimedia annotation, tagging and recommendation; multimodal analysis for retrieval applications; semantic analysis of multimedia and contextual data; multimedia fusion methods; multimedia hyperlinking; media content browsing and retrieval tools; media representation and algorithms; audio, image, video processing, coding and compression; multimedia sensors and interaction modes; multimedia privacy, security and content protection; multimedia standards and related issues; advances in multimedia networking and streaming; multimedia databases, content delivery and transport; wireless and mobile multimedia networking; multi-camera and multi-view systems; augmented and virtual reality, virtual environments; real-time and interactive multimedia applications; mobile multimedia applications; multimedia web applications; multimedia authoring and personalization; interactive multimedia and interfaces; sensor networks; social and educational multimedia applications; and emerging trends.
Inhaltsverzeichnis
-
Frontmatter
-
MSCANet: Adaptive Multi-scale Context Aggregation Network for Congested Crowd Counting
Yani Zhang, Huailin Zhao, Fangbo Zhou, Qing Zhang, Yanjiao Shi, Lanjun LiangAbstractCrowd counting has achieved significant progress with deep convolutional neural networks. However, most of the existing methods don’t fully utilize spatial context information, and it is difficult for them to count the congested crowd accurately. To this end, we propose a novel Adaptive Multi-scale Context Aggregation Network (MSCANet), in which a Multi-scale Context Aggregation module (MSCA) is designed to adaptively extract and aggregate the contextual information from different scales of the crowd. More specifically, for each input, we first extract multi-scale context features via atrous convolution layers. Then, the multi-scale context features are progressively aggregated via a channel attention to enrich the crowd representations in different scales. Finally, a \(1\times 1\) convolution layer is applied to regress the crowd density. We perform extensive experiments on three public datasets: ShanghaiTech Part_A, UCF_CC_50 and UCF-QNRF, and the experimental results demonstrate the superiority of our method compared to current the state-of-the-art methods. -
Tropical Cyclones Tracking Based on Satellite Cloud Images: Database and Comprehensive Study
Cheng Huang, Sixian Chan, Cong Bai, Weilong Ding, Jinglin ZhangAbstractThe tropical cyclone is one of disaster weather that cause serious damages for human community. It is necessary to forecast the tropical cyclone efficiently and accurately for reducing the loss caused by tropical cyclones. With the development of computer vision and satellite technology, high quality meteorological data can be got and advanced technologies have been proposed in visual tracking domain. This makes it possible to develop algorithms to do the automatic tropical cyclone tracking which plays a critical role in tropical cyclone forecast. In this paper, we present a novel database for Typical Cyclone Tracking based on Satellite Cloud Image, called TCTSCI. To the best of our knowledge, TCTSCI is the first satellite cloud image database of tropical cyclone tracking. It consists of 28 video sequences and totally 3,432 frames with \(6001\times 6001\) pixels. It includes tropical cyclones of five different intensities distributing in 2019. Each frame is scientifically inspected and labeled with the authoritative tropical cyclone data. Besides, to encourage and facilitate research of multimodal methods for tropical cyclone tracking, TCTSCI provides not only visual bounding box annotations but multimodal meteorological data of tropical cyclones. We evaluate 11 state-of-the-art and widely used trackers by using OPE and EAO metrics and analyze the challenges on TCTSCI for these trackers. -
Image Registration Improved by Generative Adversarial Networks
Shiyan Jiang, Ci Wang, Chang HuangAbstractThe performances of most image registrations will decrease if the quality of the image to be registered is poor, especially contaminated with heavy distortions such as noise, blur, and uneven degradation. To solve this problem, a generative adversarial networks (GANs) based approach and the specified loss functions are proposed to improve image quality for better registration. Specifically, given the paired images, the generator network enhances the distorted image and the discriminator network compares the enhanced image with the ideal image. To efficiently discriminate the enhanced image, the loss function is designed to describe the perceptual loss and the adversarial loss, where the former measures the image similarity and the latter pushes the enhanced solution to natural image manifold. After enhancement, image features are more accurate and the registrations between feature point pairs will be more consistent. -
Deep 3D Modeling of Human Bodies from Freehand Sketching
Kaizhi Yang, Jintao Lu, Siyu Hu, Xuejin ChenAbstractCreating high-quality 3D human body models by freehand sketching is challenging because of the sparsity and ambiguity of hand-drawn strokes. In this paper, we present a sketch-based modeling system for human bodies using deep neural networks. Considering the large variety of human body shapes and poses, we adopt the widely-used parametric representation, SMPL, to produce high-quality models that are compatible with many further applications, such as telepresence, game production, and so on. However, precisely mapping hand-drawn sketches to the SMPL parameters is non-trivial due to the non-linearity and dependency between articulated body parts. In order to solve the huge ambiguity in mapping sketches onto the manifold of human bodies, we introduce the skeleton as the intermediate representation. Our skeleton-aware modeling network first interprets sparse joints from coarse sketches and then predicts the SMPL parameters based on joint-wise features. This skeleton-aware intermediate representation effectively reduces the ambiguity and complexity between the two high-dimensional spaces. Based on our light-weight interpretation network, our system supports interactive creation and editing of 3D human body models by freehand sketching. -
Two-Stage Real-Time Multi-object Tracking with Candidate Selection
Fan Wang, Lei Luo, En ZhuAbstractIn recent years, multi-object tracking is usually treated as a data association problem based on detection results, also known as tracking-by-detection. Such methods are often difficult to adapt to the requirements of time-critical video analysis applications which consider detection and tracking together. In this paper, we propose to accomplish object detection and appearance embedding via a two-stage network. On the one hand, we accelerate network inference process by sharing a set of low-level features and introducing a Position-Sensitive RoI pooling layer to better estimate the classification probability. On the other hand, to handle unreliable detection results produced by the two-stage network, we select candidates from outputs of both detection and tracking based on a novel scoring function which considers classification probability and tracking confidence together. In this way, we can achieve an effective trade-off between multi-object tracking accuracy and speed. Moreover, we conduct a cascade data association based on the selected candidates to form object trajectories. Extensive experiments show that each component of the tracking framework is effective and our real-time tracker can achieve state-of-the-art performance. -
Tell as You Imagine: Sentence Imageability-Aware Image Captioning
Kazuki Umemura, Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Keisuke Doman, Daisuke Deguchi, Hiroshi MuraseAbstractImage captioning as a multimedia task is advancing in terms of performance in generating captions for general purposes. However, it remains difficult to tailor generated captions to different applications. In this paper, we propose a sentence imageability-aware image captioning method to generate captions tailoring to various applications. Sentence imageability describes how easily the caption can be mentally imagined. This concept is applied to the captioning model to obtain a better understanding of the perception of a generated caption. First, we extend an existing image caption dataset by augmenting its captions’ diversity. Then, a sentence imageability score for each augmented caption is calculated. A modified image captioning model is trained using this extended dataset to generate captions tailoring to a specified imageability score. Experiments showed promising results in generating imageability-aware captions. Especially, results from a subjective experiment showed that the perception of the generated captions correlates with the specified score. -
Deep Face Swapping via Cross-Identity Adversarial Training
Shuhui Yang, Han Xue, Jun Ling, Li Song, Rong XieAbstractGenerative Adversarial Networks (GANs) have shown promising improvements in face synthesis and image manipulation. However, it remains difficult to swap the faces in videos with a specific target. The most well-known face swapping method, Deepfakes, focuses on reconstructing the face image with auto-encoder while paying less attention to the identity gap between the source and target faces, which causes the swapped face looks like both the source face and the target face. In this work, we propose to incorporate cross-identity adversarial training mechanism for highly photo-realistic face swapping. Specifically, we introduce corresponding discriminator to faithfully try to distinguish the swapped faces, reconstructed faces and real faces in the training process. In addition, attention mechanism is applied to make our network robust to variation of illumination. Comprehensive experiments are conducted to demonstrate the superiority of our method over baseline models in quantitative and qualitative fashion. -
Res2-Unet: An Enhanced Network for Generalized Nuclear Segmentation in Pathological Images
Shuai Zhao, Xuanya Li, Zhineng Chen, Chang Liu, Changgen PengAbstractThe morphology of nuclei in a pathological image plays an essential role in deriving high-quality diagnosis to pathologists. Recently, deep learning techniques have pushed forward this field significantly in the generalization ability, i.e., segmenting nuclei from different patients and organs by using the same CNN model. However, it remains challenging to design an effective network that segments nuclei accurately, due to their diverse color and morphological appearances, nuclei touching or overlapping, etc. In this paper, we propose a novel network named Res2-Unet to relief this problem. Res2-Unet inherits the contracting-expansive structure of U-Net. It is featured by employing advanced network modules such as the residual and squeeze-and-excitation (SE) to enhance the segmentation capability. The residual module is utilized in both contracting and expansive paths for comprehensive feature extraction and fusion, respectively. While the SE module enables selective feature propagation between the two paths. We evaluate Res2-Unet on two public nuclei segmentation benchmarks. The experiments show that by equipping the modules individually and jointly, performance gains are consistently observed compared to the baseline and several existing methods. -
Automatic Diagnosis of Glaucoma on Color Fundus Images Using Adaptive Mask Deep Network
Gang Yang, Fan Li, Dayong Ding, Jun Wu, Jie XuAbstractGlaucoma, a disease characterized by the progressive and irreversible defect of the visual field, requires a lifelong course of treatment once it is confirmed, which highlights the importance of glaucoma early detection. Due to the diversity of glaucoma diagnostic indicators and the diagnostic uncertainty of ophthalmologists, deep learning has been applied to glaucoma diagnosis by automatically extracting characteristics from color fundus images, and that has achieved great performance recently. In this paper, we propose a novel adaptive mask deep network to obtain effective glaucoma diagnosis on retinal fundus images, which fully utilizes the prior knowledge of ophthalmologists on glaucoma diagnosis to synthesize attention masks of color fundus images to locate a reasonable region of interest. Based on the synthesized masks, our method could pay careful attention to the effective visual representation of glaucoma. Experiments on several public and private fundus datasets illustrate that our method could focus on the significant area of glaucoma diagnosis and simultaneously achieve great performance in both academic environments and practical medical applications, which provides a useful contribution to improve the automatic diagnosis of glaucoma. -
Initialize with Mask: For More Efficient Federated Learning
Zirui Zhu, Lifeng SunAbstractFederated Learning (FL) is a machine learning framework proposed to utilize the large amount of private data of edge nodes in a distributed system. Data at different edge nodes often shows strong heterogeneity, which makes the convergence speed of federated learning slow and the trained model does not perform well at the edge. In this paper, we propose Federated Mask (FedMask) to address this problem. FedMask uses Fisher Information Matrix (FIM) as a mask when initializing the local model with the global model to retain the most important parameters for the local task in the local model. Meanwhile, FedMask uses Maximum Mean Discrepancy (MMD) constraint to avoid the instability of the training process. In addition, we propose a new general evaluation method for FL. Following experiments on MNIST dataset show that our method outperforms the baseline method. When the edge data is heterogeneous, the convergence speed of our method is 55% faster than that of the baseline method, and the performance is improved by 2%. -
Unsupervised Gaze: Exploration of Geometric Constraints for 3D Gaze Estimation
Yawen Lu, Yuxing Wang, Yuan Xin, Di Wu, Guoyu LuAbstractEye gaze estimation can provide critical evidence for people attention, which has extensive applications on cognitive science and computer vision areas, such as human behavior analysis and fake user identification. Existing typical methods mostly place the eye-tracking sensors directly in front of the eyeballs, which is hard to be utilized in the wild. And recent learning-based methods require prior ground truth annotations of gaze vector for training. In this paper, we propose an unsupervised learning-based method for estimating the eye gaze in 3D space. Building on top of the existing unsupervised approach to regress shape parameters and initialize the depth, we propose to apply geometric spectral photometric consistency constraint and spatial consistency constraints across multiple views in video sequences to refine the initial depth values on the detected iris landmark. We demonstrate that our method is able to learn gaze vector in the wild scenes more robust without ground truth gaze annotations or 3D supervision, and show our system leads to a competitive performance compared with existing supervised methods. -
Median-Pooling Grad-CAM: An Efficient Inference Level Visual Explanation for CNN Networks in Remote Sensing Image Classification
Wei Song, Shuyuan Dai, Dongmei Huang, Jinling Song, Liotta AntonioAbstractGradient-based visual explanation techniques, such as Grad-CAM and Grad-CAM++ have been used to interpret how convolutional neural networks make decisions. But not all techniques can work properly in the task of remote sensing (RS) image classification. In this paper, after analyzing why Grad-CAM performs worse than Grad-CAM++ for RS images classification from the perspective of weight matrix of gradients, we propose an efficient visual explanation approach dubbed median-pooling Grad-CAM. It uses median pooling to capture the main trend of gradients and approximates the contributions of feature maps with respect to a specific class. We further propose a new evaluation index, confidence drop %, to express the degree of drop of classification accuracy when occluding the important regions that are captured by the visual saliency. Experiments on two RS image datasets and for two CNN models of VGG and ResNet, show our proposed method offers a good tradeoff between interpretability and efficiency of visual explanation for CNN-based models in RS image classification. The low time-complexity median-pooling Grad-CAM could provide a good complement to the gradient-based visual explanation techniques in practice. -
Multi-granularity Recurrent Attention Graph Neural Network for Few-Shot Learning
Xu Zhang, Youjia Zhang, Zuyu ZhangAbstractFew-shot learning aims to learn a classifier that classifies unseen classes well with limited labeled samples. Existing meta learning-based works, whether graph neural network or other baseline approaches in few-shot learning, has benefited from the meta-learning process with episodic tasks to enhance the generalization ability. However, the performance of meta-learning is greatly affected by the initial embedding network, due to the limited number of samples. In this paper, we propose a novel Multi-granularity Recurrent Attention Graph Neural Network (MRA-GNN), which employs Multi-granularity graph to achieve better generalization ability for few-shot learning. We first construct the Local Proposal Network (LPN) based on attention to generate local images from foreground images. The intra-cluster similarity and the inter-cluster dissimilarity are considered in the local images to generate discriminative features. Finally, we take the local images and original images as the input of multi-grained GNN models to perform classification. We evaluate our work by extensive comparisons with previous GNN approaches and other baseline methods on two benchmark datasets (i.e., miniImageNet and CUB). The experimental study on both of the supervised and semi-supervised few-shot image classification tasks demonstrates the proposed MRA-GNN significantly improves the performances and achieves the state-of-the-art results we know. -
EEG Emotion Recognition Based on Channel Attention for E-Healthcare Applications
Xu Zhang, Tianzhi Du, Zuyu ZhangAbstractEmotion recognition based on EEG is a critical issue in Brain-Computer Interface (BCI). It also plays an important role in the e-healthcare systems, especially in the detection and treatment of patients with depression by classifying the mental states. Unlike previous works that feature extraction using multiple frequency bands leads to a redundant use of information, where similar and noisy features extracted. In this paper, we attempt to overcome this limitation with the proposed architecture, Channel Attention-based Emotion Recognition Networks (CAERN). It can capture more critical and effective EEG emotional features based on the use of attention mechanisms. Further, we employ deep residual networks (ResNets) to capture richer information and alleviate gradient vanishing. We evaluate the proposed model on two datasets: DEAP database and SJTU emotion EEG database (SEED). Compared to other EEG emotion recognition networks, the proposed model yields better performance. This demonstrates that our approach is capable of capturing more effective features for EEG emotion recognition. -
The MovieWall: A New Interface for Browsing Large Video Collections
Marij Nefkens, Wolfgang HürstAbstractStreaming services offer access to huge amounts of movie and video collections, resulting in the need for intuitive interaction designs. Yet, most current interfaces are focused on targeted search, neglecting support for interactive data exploration and prioritizing speed over experience. We present the MovieWall, a new interface that complements such designs by enabling users to randomly browse large movie collections. A pilot study proved the feasibility of our approach. We confirmed this observation with a detailed evaluation of an improved design, which received overwhelmingly positive subjective feedback; 80% of the subjects enjoyed using the application and even more stated that they would use it again. The study also gave insight into concrete characteristics of the implementation, such as the benefit of a clustered visualization. -
Keystroke Dynamics as Part of Lifelogging
Alan F. Smeaton, Naveen Garaga Krishnamurthy, Amruth Hebbasuru SuryanarayanaAbstractIn this paper we present the case for including keystroke dynamics in lifelogging. We describe how we have used a simple keystroke logging application called Loggerman, to create a dataset of longitudinal keystroke timing data spanning a period of up to seven months for four participants. We perform a detailed analysis of this data by examining the timing information associated with bigrams or pairs of adjacently-typed alphabetic characters. We show how the amount of day-on-day variation of the keystroke timing among the top-200 bigrams for participants varies with the amount of typing each would do on a daily basis. We explore how daily variations could correlate with sleep score from the previous night but find no significant relationship between the two. Finally we describe the public release of a portion of this data and we include a series of pointers for future work including correlating keystroke dynamics with mood and fatigue during the day. -
HTAD: A Home-Tasks Activities Dataset with Wrist-Accelerometer and Audio Features
Enrique Garcia-Ceja, Vajira Thambawita, Steven A. Hicks, Debesh Jha, Petter Jakobsen, Hugo L. Hammer, Pål Halvorsen, Michael A. RieglerAbstractIn this paper, we present HTAD: A Home Tasks Activities Dataset. The dataset contains wrist-accelerometer and audio data from people performing at-home tasks such as sweeping, brushing teeth, washing hands, or watching TV. These activities represent a subset of activities that are needed to be able to live independently. Being able to detect activities with wearable devices in real-time is important for the realization of assistive technologies with applications in different domains such as elderly care and mental health monitoring. Preliminary results show that using machine learning with the presented dataset leads to promising results, but also there is still improvement potential. By making this dataset public, researchers can test different machine learning algorithms for activity recognition, especially, sensor data fusion methods.
- Titel
- MultiMedia Modeling
- Herausgegeben von
-
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
- Copyright-Jahr
- 2021
- Electronic ISBN
- 978-3-030-67835-7
- Print ISBN
- 978-3-030-67834-0
- DOI
- https://doi.org/10.1007/978-3-030-67835-7
Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.