Skip to main content

2021 | Book

MultiMedia Modeling

27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II

Editors: Jakub Lokoč, Prof. Tomáš Skopal, Prof. Dr. Klaus Schoeffmann, Vasileios Mezaris, Dr. Xirong Li, Dr. Stefanos Vrochidis, Dr. Ioannis Patras

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science


About this book

The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.

Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were accepted as well as 2 papers for a demo presentation and 17 papers for participation at the Video Browser Showdown 2021. The papers cover topics such as: multimedia indexing; multimedia mining; multimedia abstraction and summarization; multimedia annotation, tagging and recommendation; multimodal analysis for retrieval applications; semantic analysis of multimedia and contextual data; multimedia fusion methods; multimedia hyperlinking; media content browsing and retrieval tools; media representation and algorithms; audio, image, video processing, coding and compression; multimedia sensors and interaction modes; multimedia privacy, security and content protection; multimedia standards and related issues; advances in multimedia networking and streaming; multimedia databases, content delivery and transport; wireless and mobile multimedia networking; multi-camera and multi-view systems; augmented and virtual reality, virtual environments; real-time and interactive multimedia applications; mobile multimedia applications; multimedia web applications; multimedia authoring and personalization; interactive multimedia and interfaces; sensor networks; social and educational multimedia applications; and emerging trends.

Table of Contents

MSCANet: Adaptive Multi-scale Context Aggregation Network for Congested Crowd Counting

Crowd counting has achieved significant progress with deep convolutional neural networks. However, most of the existing methods don’t fully utilize spatial context information, and it is difficult for them to count the congested crowd accurately. To this end, we propose a novel Adaptive Multi-scale Context Aggregation Network (MSCANet), in which a Multi-scale Context Aggregation module (MSCA) is designed to adaptively extract and aggregate the contextual information from different scales of the crowd. More specifically, for each input, we first extract multi-scale context features via atrous convolution layers. Then, the multi-scale context features are progressively aggregated via a channel attention to enrich the crowd representations in different scales. Finally, a $$1\times 1$$ 1 × 1 convolution layer is applied to regress the crowd density. We perform extensive experiments on three public datasets: ShanghaiTech Part_A, UCF_CC_50 and UCF-QNRF, and the experimental results demonstrate the superiority of our method compared to current the state-of-the-art methods.

Yani Zhang, Huailin Zhao, Fangbo Zhou, Qing Zhang, Yanjiao Shi, Lanjun Liang
Tropical Cyclones Tracking Based on Satellite Cloud Images: Database and Comprehensive Study

The tropical cyclone is one of disaster weather that cause serious damages for human community. It is necessary to forecast the tropical cyclone efficiently and accurately for reducing the loss caused by tropical cyclones. With the development of computer vision and satellite technology, high quality meteorological data can be got and advanced technologies have been proposed in visual tracking domain. This makes it possible to develop algorithms to do the automatic tropical cyclone tracking which plays a critical role in tropical cyclone forecast. In this paper, we present a novel database for Typical Cyclone Tracking based on Satellite Cloud Image, called TCTSCI. To the best of our knowledge, TCTSCI is the first satellite cloud image database of tropical cyclone tracking. It consists of 28 video sequences and totally 3,432 frames with $$6001\times 6001$$ 6001 × 6001 pixels. It includes tropical cyclones of five different intensities distributing in 2019. Each frame is scientifically inspected and labeled with the authoritative tropical cyclone data. Besides, to encourage and facilitate research of multimodal methods for tropical cyclone tracking, TCTSCI provides not only visual bounding box annotations but multimodal meteorological data of tropical cyclones. We evaluate 11 state-of-the-art and widely used trackers by using OPE and EAO metrics and analyze the challenges on TCTSCI for these trackers.

Cheng Huang, Sixian Chan, Cong Bai, Weilong Ding, Jinglin Zhang
Image Registration Improved by Generative Adversarial Networks

The performances of most image registrations will decrease if the quality of the image to be registered is poor, especially contaminated with heavy distortions such as noise, blur, and uneven degradation. To solve this problem, a generative adversarial networks (GANs) based approach and the specified loss functions are proposed to improve image quality for better registration. Specifically, given the paired images, the generator network enhances the distorted image and the discriminator network compares the enhanced image with the ideal image. To efficiently discriminate the enhanced image, the loss function is designed to describe the perceptual loss and the adversarial loss, where the former measures the image similarity and the latter pushes the enhanced solution to natural image manifold. After enhancement, image features are more accurate and the registrations between feature point pairs will be more consistent.

Shiyan Jiang, Ci Wang, Chang Huang
Deep 3D Modeling of Human Bodies from Freehand Sketching

Creating high-quality 3D human body models by freehand sketching is challenging because of the sparsity and ambiguity of hand-drawn strokes. In this paper, we present a sketch-based modeling system for human bodies using deep neural networks. Considering the large variety of human body shapes and poses, we adopt the widely-used parametric representation, SMPL, to produce high-quality models that are compatible with many further applications, such as telepresence, game production, and so on. However, precisely mapping hand-drawn sketches to the SMPL parameters is non-trivial due to the non-linearity and dependency between articulated body parts. In order to solve the huge ambiguity in mapping sketches onto the manifold of human bodies, we introduce the skeleton as the intermediate representation. Our skeleton-aware modeling network first interprets sparse joints from coarse sketches and then predicts the SMPL parameters based on joint-wise features. This skeleton-aware intermediate representation effectively reduces the ambiguity and complexity between the two high-dimensional spaces. Based on our light-weight interpretation network, our system supports interactive creation and editing of 3D human body models by freehand sketching.

Kaizhi Yang, Jintao Lu, Siyu Hu, Xuejin Chen
Two-Stage Real-Time Multi-object Tracking with Candidate Selection

In recent years, multi-object tracking is usually treated as a data association problem based on detection results, also known as tracking-by-detection. Such methods are often difficult to adapt to the requirements of time-critical video analysis applications which consider detection and tracking together. In this paper, we propose to accomplish object detection and appearance embedding via a two-stage network. On the one hand, we accelerate network inference process by sharing a set of low-level features and introducing a Position-Sensitive RoI pooling layer to better estimate the classification probability. On the other hand, to handle unreliable detection results produced by the two-stage network, we select candidates from outputs of both detection and tracking based on a novel scoring function which considers classification probability and tracking confidence together. In this way, we can achieve an effective trade-off between multi-object tracking accuracy and speed. Moreover, we conduct a cascade data association based on the selected candidates to form object trajectories. Extensive experiments show that each component of the tracking framework is effective and our real-time tracker can achieve state-of-the-art performance.

Fan Wang, Lei Luo, En Zhu
Tell as You Imagine: Sentence Imageability-Aware Image Captioning

Image captioning as a multimedia task is advancing in terms of performance in generating captions for general purposes. However, it remains difficult to tailor generated captions to different applications. In this paper, we propose a sentence imageability-aware image captioning method to generate captions tailoring to various applications. Sentence imageability describes how easily the caption can be mentally imagined. This concept is applied to the captioning model to obtain a better understanding of the perception of a generated caption. First, we extend an existing image caption dataset by augmenting its captions’ diversity. Then, a sentence imageability score for each augmented caption is calculated. A modified image captioning model is trained using this extended dataset to generate captions tailoring to a specified imageability score. Experiments showed promising results in generating imageability-aware captions. Especially, results from a subjective experiment showed that the perception of the generated captions correlates with the specified score.

Kazuki Umemura, Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Keisuke Doman, Daisuke Deguchi, Hiroshi Murase
Deep Face Swapping via Cross-Identity Adversarial Training

Generative Adversarial Networks (GANs) have shown promising improvements in face synthesis and image manipulation. However, it remains difficult to swap the faces in videos with a specific target. The most well-known face swapping method, Deepfakes, focuses on reconstructing the face image with auto-encoder while paying less attention to the identity gap between the source and target faces, which causes the swapped face looks like both the source face and the target face. In this work, we propose to incorporate cross-identity adversarial training mechanism for highly photo-realistic face swapping. Specifically, we introduce corresponding discriminator to faithfully try to distinguish the swapped faces, reconstructed faces and real faces in the training process. In addition, attention mechanism is applied to make our network robust to variation of illumination. Comprehensive experiments are conducted to demonstrate the superiority of our method over baseline models in quantitative and qualitative fashion.

Shuhui Yang, Han Xue, Jun Ling, Li Song, Rong Xie
Res2-Unet: An Enhanced Network for Generalized Nuclear Segmentation in Pathological Images

The morphology of nuclei in a pathological image plays an essential role in deriving high-quality diagnosis to pathologists. Recently, deep learning techniques have pushed forward this field significantly in the generalization ability, i.e., segmenting nuclei from different patients and organs by using the same CNN model. However, it remains challenging to design an effective network that segments nuclei accurately, due to their diverse color and morphological appearances, nuclei touching or overlapping, etc. In this paper, we propose a novel network named Res2-Unet to relief this problem. Res2-Unet inherits the contracting-expansive structure of U-Net. It is featured by employing advanced network modules such as the residual and squeeze-and-excitation (SE) to enhance the segmentation capability. The residual module is utilized in both contracting and expansive paths for comprehensive feature extraction and fusion, respectively. While the SE module enables selective feature propagation between the two paths. We evaluate Res2-Unet on two public nuclei segmentation benchmarks. The experiments show that by equipping the modules individually and jointly, performance gains are consistently observed compared to the baseline and several existing methods.

Shuai Zhao, Xuanya Li, Zhineng Chen, Chang Liu, Changgen Peng
Automatic Diagnosis of Glaucoma on Color Fundus Images Using Adaptive Mask Deep Network

Glaucoma, a disease characterized by the progressive and irreversible defect of the visual field, requires a lifelong course of treatment once it is confirmed, which highlights the importance of glaucoma early detection. Due to the diversity of glaucoma diagnostic indicators and the diagnostic uncertainty of ophthalmologists, deep learning has been applied to glaucoma diagnosis by automatically extracting characteristics from color fundus images, and that has achieved great performance recently. In this paper, we propose a novel adaptive mask deep network to obtain effective glaucoma diagnosis on retinal fundus images, which fully utilizes the prior knowledge of ophthalmologists on glaucoma diagnosis to synthesize attention masks of color fundus images to locate a reasonable region of interest. Based on the synthesized masks, our method could pay careful attention to the effective visual representation of glaucoma. Experiments on several public and private fundus datasets illustrate that our method could focus on the significant area of glaucoma diagnosis and simultaneously achieve great performance in both academic environments and practical medical applications, which provides a useful contribution to improve the automatic diagnosis of glaucoma.

Gang Yang, Fan Li, Dayong Ding, Jun Wu, Jie Xu
Initialize with Mask: For More Efficient Federated Learning

Federated Learning (FL) is a machine learning framework proposed to utilize the large amount of private data of edge nodes in a distributed system. Data at different edge nodes often shows strong heterogeneity, which makes the convergence speed of federated learning slow and the trained model does not perform well at the edge. In this paper, we propose Federated Mask (FedMask) to address this problem. FedMask uses Fisher Information Matrix (FIM) as a mask when initializing the local model with the global model to retain the most important parameters for the local task in the local model. Meanwhile, FedMask uses Maximum Mean Discrepancy (MMD) constraint to avoid the instability of the training process. In addition, we propose a new general evaluation method for FL. Following experiments on MNIST dataset show that our method outperforms the baseline method. When the edge data is heterogeneous, the convergence speed of our method is 55% faster than that of the baseline method, and the performance is improved by 2%.

Zirui Zhu, Lifeng Sun
Unsupervised Gaze: Exploration of Geometric Constraints for 3D Gaze Estimation

Eye gaze estimation can provide critical evidence for people attention, which has extensive applications on cognitive science and computer vision areas, such as human behavior analysis and fake user identification. Existing typical methods mostly place the eye-tracking sensors directly in front of the eyeballs, which is hard to be utilized in the wild. And recent learning-based methods require prior ground truth annotations of gaze vector for training. In this paper, we propose an unsupervised learning-based method for estimating the eye gaze in 3D space. Building on top of the existing unsupervised approach to regress shape parameters and initialize the depth, we propose to apply geometric spectral photometric consistency constraint and spatial consistency constraints across multiple views in video sequences to refine the initial depth values on the detected iris landmark. We demonstrate that our method is able to learn gaze vector in the wild scenes more robust without ground truth gaze annotations or 3D supervision, and show our system leads to a competitive performance compared with existing supervised methods.

Yawen Lu, Yuxing Wang, Yuan Xin, Di Wu, Guoyu Lu
Median-Pooling Grad-CAM: An Efficient Inference Level Visual Explanation for CNN Networks in Remote Sensing Image Classification

Gradient-based visual explanation techniques, such as Grad-CAM and Grad-CAM++ have been used to interpret how convolutional neural networks make decisions. But not all techniques can work properly in the task of remote sensing (RS) image classification. In this paper, after analyzing why Grad-CAM performs worse than Grad-CAM++ for RS images classification from the perspective of weight matrix of gradients, we propose an efficient visual explanation approach dubbed median-pooling Grad-CAM. It uses median pooling to capture the main trend of gradients and approximates the contributions of feature maps with respect to a specific class. We further propose a new evaluation index, confidence drop %, to express the degree of drop of classification accuracy when occluding the important regions that are captured by the visual saliency. Experiments on two RS image datasets and for two CNN models of VGG and ResNet, show our proposed method offers a good tradeoff between interpretability and efficiency of visual explanation for CNN-based models in RS image classification. The low time-complexity median-pooling Grad-CAM could provide a good complement to the gradient-based visual explanation techniques in practice.

Wei Song, Shuyuan Dai, Dongmei Huang, Jinling Song, Liotta Antonio
Multi-granularity Recurrent Attention Graph Neural Network for Few-Shot Learning

Few-shot learning aims to learn a classifier that classifies unseen classes well with limited labeled samples. Existing meta learning-based works, whether graph neural network or other baseline approaches in few-shot learning, has benefited from the meta-learning process with episodic tasks to enhance the generalization ability. However, the performance of meta-learning is greatly affected by the initial embedding network, due to the limited number of samples. In this paper, we propose a novel Multi-granularity Recurrent Attention Graph Neural Network (MRA-GNN), which employs Multi-granularity graph to achieve better generalization ability for few-shot learning. We first construct the Local Proposal Network (LPN) based on attention to generate local images from foreground images. The intra-cluster similarity and the inter-cluster dissimilarity are considered in the local images to generate discriminative features. Finally, we take the local images and original images as the input of multi-grained GNN models to perform classification. We evaluate our work by extensive comparisons with previous GNN approaches and other baseline methods on two benchmark datasets (i.e., miniImageNet and CUB). The experimental study on both of the supervised and semi-supervised few-shot image classification tasks demonstrates the proposed MRA-GNN significantly improves the performances and achieves the state-of-the-art results we know.

Xu Zhang, Youjia Zhang, Zuyu Zhang
EEG Emotion Recognition Based on Channel Attention for E-Healthcare Applications

Emotion recognition based on EEG is a critical issue in Brain-Computer Interface (BCI). It also plays an important role in the e-healthcare systems, especially in the detection and treatment of patients with depression by classifying the mental states. Unlike previous works that feature extraction using multiple frequency bands leads to a redundant use of information, where similar and noisy features extracted. In this paper, we attempt to overcome this limitation with the proposed architecture, Channel Attention-based Emotion Recognition Networks (CAERN). It can capture more critical and effective EEG emotional features based on the use of attention mechanisms. Further, we employ deep residual networks (ResNets) to capture richer information and alleviate gradient vanishing. We evaluate the proposed model on two datasets: DEAP database and SJTU emotion EEG database (SEED). Compared to other EEG emotion recognition networks, the proposed model yields better performance. This demonstrates that our approach is capable of capturing more effective features for EEG emotion recognition.

Xu Zhang, Tianzhi Du, Zuyu Zhang
The MovieWall: A New Interface for Browsing Large Video Collections

Streaming services offer access to huge amounts of movie and video collections, resulting in the need for intuitive interaction designs. Yet, most current interfaces are focused on targeted search, neglecting support for interactive data exploration and prioritizing speed over experience. We present the MovieWall, a new interface that complements such designs by enabling users to randomly browse large movie collections. A pilot study proved the feasibility of our approach. We confirmed this observation with a detailed evaluation of an improved design, which received overwhelmingly positive subjective feedback; 80% of the subjects enjoyed using the application and even more stated that they would use it again. The study also gave insight into concrete characteristics of the implementation, such as the benefit of a clustered visualization.

Marij Nefkens, Wolfgang Hürst
Keystroke Dynamics as Part of Lifelogging

In this paper we present the case for including keystroke dynamics in lifelogging. We describe how we have used a simple keystroke logging application called Loggerman, to create a dataset of longitudinal keystroke timing data spanning a period of up to seven months for four participants. We perform a detailed analysis of this data by examining the timing information associated with bigrams or pairs of adjacently-typed alphabetic characters. We show how the amount of day-on-day variation of the keystroke timing among the top-200 bigrams for participants varies with the amount of typing each would do on a daily basis. We explore how daily variations could correlate with sleep score from the previous night but find no significant relationship between the two. Finally we describe the public release of a portion of this data and we include a series of pointers for future work including correlating keystroke dynamics with mood and fatigue during the day.

Alan F. Smeaton, Naveen Garaga Krishnamurthy, Amruth Hebbasuru Suryanarayana
HTAD: A Home-Tasks Activities Dataset with Wrist-Accelerometer and Audio Features

In this paper, we present HTAD: A Home Tasks Activities Dataset. The dataset contains wrist-accelerometer and audio data from people performing at-home tasks such as sweeping, brushing teeth, washing hands, or watching TV. These activities represent a subset of activities that are needed to be able to live independently. Being able to detect activities with wearable devices in real-time is important for the realization of assistive technologies with applications in different domains such as elderly care and mental health monitoring. Preliminary results show that using machine learning with the presented dataset leads to promising results, but also there is still improvement potential. By making this dataset public, researchers can test different machine learning algorithms for activity recognition, especially, sensor data fusion methods.

Enrique Garcia-Ceja, Vajira Thambawita, Steven A. Hicks, Debesh Jha, Petter Jakobsen, Hugo L. Hammer, Pål Halvorsen, Michael A. Riegler
MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset. A Case Study in Ho Chi Minh City, Vietnam

This paper introduces an economical and dynamic crowdsourcing mechanism to collect personal lifelog associated environment datasets, namely MNR-Air. This mechanism’s significant advantage is to use personal sensor boxes that can be carried on citizens (and their vehicles) to collect data. The MNR-HCM dataset is also introduced in this paper as the output of MNR-Air and collected in Ho Chi Minh City, Vietnam. The MNR-HCM dataset contains weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition of urban nature on a personal scale. We also introduce AQI-T-RM, an application that can help people plan their travel to avoid as much air pollution as possible while still saving time on travel. Besides, we discuss how useful MNR-Air is when contributing to the open data science community and other communities that benefit citizens living in urban areas.

Dang-Hieu Nguyen, Tan-Loc Nguyen-Tai, Minh-Tam Nguyen, Thanh-Binh Nguyen, Minh-Son Dao
Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy

Gastrointestinal (GI) pathologies are periodically screened, biopsied, and resected using surgical tools. Usually, the procedures and the treated or resected areas are not specifically tracked or analysed during or after colonoscopies. Information regarding disease borders, development, amount, and size of the resected area get lost. This can lead to poor follow-up and bothersome reassessment difficulties post-treatment. To improve the current standard and also to foster more research on the topic, we have released the “Kvasir-Instrument” dataset, which consists of 590 annotated frames containing GI procedure tools such as snares, balloons, and biopsy forceps, etc. Besides the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists. Additionally, we provide a baseline for the segmentation of the GI tools to promote research and algorithm development. We obtained a dice coefficient score of 0.9158 and a Jaccard index of 0.8578 using a classical U-Net architecture. A similar dice coefficient score was observed for DoubleUNet. The qualitative results showed that the model did not work for the images with specularity and the frames with multiple tools, while the best result for both methods was observed on all other types of images. Both qualitative and quantitative results show that the model performs reasonably good, but there is potential for further improvements. Benchmarking using the dataset provides an opportunity for researchers to contribute to the field of automatic endoscopic diagnostic and therapeutic tool segmentation for GI endoscopy.

Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A. Hicks, Vajira Thambawita, Enrique Garcia-Ceja, Michael A. Riegler, Thomas de Lange, Peter T. Schmidt, Håvard D. Johansen, Dag Johansen, Pål Halvorsen
CatMeows: A Publicly-Available Dataset of Cat Vocalizations

This work presents a dataset of cat vocalizations focusing on the meows emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. The dataset contains vocalizations produced by 21 cats belonging to two breeds, namely Maine Coon and European Shorthair. Sounds have been recorded using low-cost devices easily available on the marketplace, and the data acquired are representative of real-world cases both in terms of audio quality and acoustic conditions. The dataset is open-access, released under Creative Commons Attribution 4.0 International licence, and it can be retrieved from the Zenodo web repository.

Luca A. Ludovico, Stavros Ntalampiras, Giorgio Presti, Simona Cannas, Monica Battini, Silvana Mattiello
Search and Explore Strategies for Interactive Analysis of Real-Life Image Collections with Unknown and Unique Categories

Many real-life image collections contain image categories that are unique to that specific image collection and have not been seen before by any human expert analyst nor by a machine. This prevents supervised machine learning to be effective and makes evaluation of such an image collection inefficient. Real-life collections ask for a multimedia analytics solution where the expert performs search and explores the image collection, supported by machine learning algorithms. We propose a method that covers both exploration and search strategies for such complex image collections. Several strategies are evaluated through an artificial user model. Two user studies were performed with experts and students respectively to validate the proposed method. As evaluation of such a method can only be done properly in a real-life application, the proposed method is applied on the MH17 airplane crash photo database on which we have expert knowledge. To show that the proposed method also helps with other image collections an image collection created with the Open Image Database is used. We show that by combining image features extracted with a convolutional neural network pretrained on ImageNet 1k, intelligent use of clustering, a well chosen strategy and expert knowledge, an image collection such as the MH17 airplane crash photo database can be interactively structured into relevant dynamically generated categories, allowing the user to analyse an image collection efficiently.

Floris Gisolf, Zeno Geradts, Marcel Worring
Graph-Based Indexing and Retrieval of Lifelog Data

Understanding the relationship between objects in an image is an important challenge because it can help to describe actions in the image. In this paper, a graphical data structure, named “Scene Graph”, is utilized to represent an encoded informative visual relationship graph for an image, which we suggest has a wide range of potential applications. This scene graph is applied and tested in the popular domain of lifelogs, and specifically in the challenge of known-item retrieval from lifelogs. In this work, every lifelog image is represented by a scene graph, and at retrieval time, this scene graph is compared with the semantic graph, parsed from a textual query. The result is combined with location or date information to determine the matching items. The experiment shows that this technique can outperform a conventional method.

Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin
On Fusion of Learned and Designed Features for Video Data Analytics

Video cameras have become widely used for indoor and outdoor surveillance. Covering more and more public space in cities, the cameras serve various purposes ranging from security to traffic monitoring, urban life, and marketing. However, with the increasing quantity of utilized cameras and recorded streams, manual video monitoring and analysis becomes too laborious. The goal is to obtain effective and efficient artificial intelligence models to process the video data automatically and produce the desired features for data analytics. To this end, we propose a framework for real-time video feature extraction that fuses both learned and hand-designed analytical models and is applicable in real-life situations. Nowadays, state-of-the-art models for various computer vision tasks are implemented by deep learning. However, the exhaustive gathering of labeled training data and the computational complexity of resulting models can often render them impractical. We need to consider the benefits and limitations of each technique and find the synergy between both deep learning and analytical models. Deep learning methods are more suited for simpler tasks on large volumes of dense data while analytical modeling can be sufficient for processing of sparse data with complex structures. Our framework follows those principles by taking advantage of multiple levels of abstraction. In a use case, we show how the framework can be set for an advanced video analysis of urban life.

Marek Dobranský, Tomáš Skopal
XQM: Interactive Learning on Mobile Phones

There is an increasing need for intelligent interaction with media collections, and mobile phones are gaining significant traction as the device of choice for many users. In this paper, we present XQM, a mobile approach for intelligent interaction with the user’s media on the phone, tackling the inherent challenges of the highly dynamic nature of mobile media collections and limited computational resources of the mobile device. We employ interactive learning, a method that conducts interaction rounds with the user, each consisting of the system suggesting relevant images based on its current model, the user providing relevance labels, the system’s model retraining itself based on these labels, and the system obtaining a new set of suggestions for the next round. This method is suitable for the dynamic nature of mobile media collections and the limited computational resources. We show that XQM, a full-fledged app implemented for Android, operates on 10K image collections in interactive time (less than 1.4 s per interaction round), and evaluate user experience in a user study that confirms XQM’s effectiveness.

Alexandra M. Bagi, Kim I. Schild, Omar Shahbaz Khan, Jan Zahálka, Björn Þór Jónsson
A Multimodal Tensor-Based Late Fusion Approach for Satellite Image Search in Sentinel 2 Images

Earth Observation (EO) Big Data Collections are acquired at large volumes and variety, due to their high heterogeneous nature. The multimodal character of EO Big Data requires effective combination of multiple modalities for similarity search. We propose a late fusion mechanism of multiple rankings to combine the results from several uni-modal searches in Sentinel 2 image collections. We fist create a K-order tensor from the results of separate searches by visual features, concepts, spatial and temporal information. Visual concepts and features are based on a vector representation from Deep Convolutional Neural Networks. 2D-surfaces of the K-order tensor initially provide candidate retrieved results per ranking position and are merged to obtain the final list of retrieved results. Satellite image patches are used as queries in order to retrieve the most relevant image patches in Sentinel 2 images. Quantitative and qualitative results show that the proposed method outperforms search by a single modality and other late fusion methods.

Ilias Gialampoukidis, Anastasia Moumtzidou, Marios Bakratsas, Stefanos Vrochidis, Ioannis Kompatsiaris
Canopy Height Estimation from Spaceborne Imagery Using Convolutional Encoder-Decoder

The recent advances in multimedia modeling with deep learning methods have significantly affected remote sensing applications, such as canopy height mapping. Estimating canopy height maps in large-scale is an important step towards sustainable ecosystem management. Apart from the standard height estimation method using LiDAR data, other airborne measurement techniques, such as very high-resolution passive airborne imaging, have also shown to provide accurate estimations. However, those methods suffer from high cost and cannot be used at large-scale nor frequently. In our study, we adopt a neural network architecture to estimate pixel-wise canopy height from cost-effective spaceborne imagery. A deep convolutional encoder-decoder network, based on the SegNet architecture together with skip connections, is trained to embed the multi-spectral pixels of a Sentinel-2 input image to height values via end-to-end learned texture features. Experimental results in a study area of 942 $$\mathrm{km}^2$$ km 2 yield similar or better estimation accuracy resolution in comparison with a method based on costly airborne images as well as with another state-of-the-art deep learning approach based on spaceborne images.

Leonidas Alagialoglou, Ioannis Manakos, Marco Heurich, Jaroslav Červenka, Anastasios Delopoulos
Implementation of a Random Forest Classifier to Examine Wildfire Predictive Modelling in Greece Using Diachronically Collected Fire Occurrence and Fire Mapping Data

Forest fires cause severe damages in ecosystems, human lives and infrastructure globally. This situation tends to get worse in the next decades due to climate change and the expected increase in the length and severity of the fire season. Thus, the ability to develop a method that reliably models the risk of fire occurrence is an important step towards preventing, confronting and limiting the disaster. Different approaches building upon Machine Learning (ML) methods for predicting wildfires and deriving a better understanding of fires’ regimes have been devised. This study demonstrates the development of a Random Forest (RF) classifier to predict “fire”/“non fire” classes in Greece. For this a prototype and representative for the Mediterranean ecosystem database of validated fires and fire related features has been created. The database is populated with data (e.g. Earth Observation derived biophysical parameters and daily collected climatic and weather data) for a period of nine years (2010–2018). Spatially it refers to grid cells of 500 m wide where Active Fires (AF) and Burned Areas/Burn Scars (BSM) were reported during that period. By using feature ranking techniques as Chi-squared and Spearman correlations the study showcases the most significant wildfire triggering variables. It also highlights the extent by which the database and selected features scheme can be used to successfully train a RF classifier for deriving “fire”/“non-fire” predictions over the country of Greece in the prospect of generating a dynamic fire risk system for daily assessments.

Alexis Apostolakis, Stella Girtsou, Charalampos Kontoes, Ioannis Papoutsis, Michalis Tsoutsos
Mobile eHealth Platform for Home Monitoring of Bipolar Disorder

People suffering Bipolar Disorder (BD) experiment changes in mood status having depressive or manic episodes with normal periods in the middle. BD is a chronic disease with a high level of non-adherence to medication that needs a continuous monitoring of patients to detect when they relapse in an episode, so that physicians can take care of them. Here we present MoodRecord, an easy-to-use, non-intrusive, multilingual, robust and scalable platform suitable for home monitoring patients with BD, that allows physicians and relatives to track the patient state and get alarms when abnormalities occur.MoodRecord takes advantage of the capabilities of smartphones as a communication and recording device to do a continuous monitoring of patients. It automatically records user activity, and asks the user to answer some questions or to record himself in video, according to a predefined plan designed by physicians. The video is analysed, recognising the mood status from images and bipolar assessment scores are extracted from speech parameters. The data obtained from the different sources are merged periodically to observe if a relapse may start and if so, raise the corresponding alarm. The application got a positive evaluation in a pilot with users from three different countries. During the pilot, the predictions of the voice and image modules showed a coherent correlation with the diagnosis performed by clinicians.

Joan Codina-Filbà, Sergio Escalera, Joan Escudero, Coen Antens, Pau Buch-Cardona, Mireia Farrús
Multimodal Sensor Data Analysis for Detection of Risk Situations of Fragile People in @home Environments

Multimedia (MM) nowadays often means “Multimodality”. The target application area of MM technologies further extends to healthcare. Health parameters monitoring, context and situational recognition in ambient assisted living - all these applications require tailored solutions. We are interested in development of AI solutions for prevention of risk situations of fragile people living at home. This research requires a tight collaboration of IT researchers with psychologists and kinesiologists. In this paper we present a large collaborative project between such actors for developing future solutions of risk situations detection of fragile people. We report on definition of risk scenarios which have been simulated in the data collected with the developed Android application. Adapted annotation scenarios for sensory and visual data are elaborated. A pilot corpus recorded with healthy volunteers in everyday life situations is presented. Preliminary detection results on LSC dataset show the complexity of real-life recognition tasks.

Thinhinane Yebda, Jenny Benois-Pineau, Marion Pech, Hélène Amieva, Laura Middleton, Max Bergelt
Towards the Development of a Trustworthy Chatbot for Mental Health Applications

Research on conversational chatbots for mental health applications is an emerging topic. Current work focuses primarily on the usability and acceptance of such systems. However, the human-computer trust relationship is often overlooked, even though being highly important for the acceptance of chatbots in a clinical environment. This paper presents the creation and evaluation of a trustworthy agent using relational and proactive dialogue. A pilot study with non-clinical subjects showed that a relational strategy using empathetic reactions and small-talk failed to foster human-computer trust. However, changing the initiative to be more proactive seems to be welcomed as it is perceived more reliable and understandable by users.

Matthias Kraus, Philip Seldschopf, Wolfgang Minker
Fusion of Multimodal Sensor Data for Effective Human Action Recognition in the Service of Medical Platforms

In what has arguably been one of the most troubling periods of recent medical history, with a global pandemic emphasising the importance of staying healthy, innovative tools that shelter patient well-being gain momentum. In that view, a framework is proposed that leverages multimodal data, namely inertial and depth sensor-originating data, can be integrated in health care-oriented platforms, and tackles the crucial task of human action recognition (HAR). To analyse person movement and consequently assess the patient’s condition, an effective methodology is presented that is two-fold: initially, Kinect-based action representations are constructed from handcrafted 3DHOG depth features and the descriptive power of a Fisher encoding scheme. This is complemented by wearable sensor data analysis, using time domain features and then boosted by exploring fusion strategies of minimum expense. Finally, an extended experimental process reveals competitive results in a well-known benchmark dataset and indicates the applicability of our methodology for HAR.

Panagiotis Giannakeris, Athina Tsanousa, Thanasis Mavropoulos, Georgios Meditskos, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris
SpotifyGraph: Visualisation of User’s Preferences in Music

Many music streaming portals recommend lists of songs to the users. These recommendations are often results of black-box algorithms (from the user’s perspective). However, irrelevant recommendations without the proper justification may considerably hinder the user’s trust. Moreover, user profiles in music streaming services tend to be very large, consisting of hundreds of artists and thousands of tracks. So, not only the recommendation procedure details are hidden for the user, but he/she often lacks a sufficient knowledge about the source data the recommendations are derived from. In order to cope with these challenges, we propose SpotifyGraph application. The application aims on a comprehensible visualization of the relations within the Spotify user’s profile and therefore improve understandability of provided recommendations.

Pavel Gajdusek, Ladislav Peska
A System for Interactive Multimedia Retrieval Evaluations

The evaluation of the performance of interactive multimedia retrieval systems is a methodologically non-trivial endeavour and requires specialized infrastructure. Current evaluation campaigns have so far relied on a local setting, where all retrieval systems needed to be evaluated at the same physical location at the same time. This constraint does not only complicate the organization and coordination but also limits the number of systems which can reasonably be evaluated within a set time frame. Travel restrictions might further limit the possibility for such evaluations. To address these problems, evaluations need to be conducted in a (geographically) distributed setting, which was so far not possible due to the lack of supporting infrastructure. In this paper, we present the Distributed Retrieval Evaluation Server (DRES), an open-source evaluation system to facilitate evaluation campaigns for interactive multimedia retrieval systems in both traditional on-site as well as fully distributed settings which has already proven effective in a competitive evaluation.

Luca Rossetto, Ralph Gasser, Loris Sauter, Abraham Bernstein, Heiko Schuldt
SQL-Like Interpretable Interactive Video Search

Concept-free search, which embeds text and video signals in a joint space for retrieval, appears to be a new state-of-the-art. However, this new search paradigm suffers from two limitations. First, the search result is unpredictable and not interpretable. Second, the embedded features are in high-dimensional space hindering real-time indexing and search. In this paper, we present a new implementation of the Vireo video search system (Vireo-VSS), which employs a dual-task model to index each video segment with an embedding feature in a low dimension and a concept list for retrieval. The concept list serves as a reference to interpret its associated embedded feature. With these changes, a SQL-like querying interface is designed such that a user can specify the search content (subject, predicate, object) and constraint (logical condition) in a semi-structured way. The system will decompose the SQL-like query into multiple sub-queries depending on the constraint being specified. Each sub-query is translated into an embedding feature and a concept list for video retrieval. The search result is compiled by union or pruning of the search lists from multiple sub-queries. The SQL-like interface is also extended for temporal querying, by providing multiple SQL templates for users to specify the temporal evolution of a query.

Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, Chong-Wah Ngo
VERGE in VBS 2021

This paper presents VERGE, an interactive video search engine that supports efficient browsing and searching into a collection of images or videos. The framework involves a variety of retrieval approaches as well as reranking and fusion capabilities. A Web application enables users to create queries and view the results in a fast and friendly manner.

Stelios Andreadis, Anastasia Moumtzidou, Konstantinos Gkountakos, Nick Pantelidis, Konstantinos Apostolidis, Damianos Galanopoulos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris
NoShot Video Browser at VBS2021

We present our NoShot Video Browser, which has been successfully used at the last Video Browser Showdown competition VBS2020 at the MMM2020. NoShot is given its name due to the fact, that it neither makes use of any kind of shot detection nor utilize the VBS master shots. Instead videos are split into frames with a time distance of one second. The biggest strength of the system lies in its feature “time cache”, which shows results with the best confidence in a range of seconds.

Christof Karisch, Andreas Leibetseder, Klaus Schoeffmann
Exquisitor at the Video Browser Showdown 2021: Relationships Between Semantic Classifiers

Exquisitor is a scalable media exploration system based on interactive learning, which first took part in VBS in 2020. This paper presents an extension to Exquisitor, which supports operations on semantic classifiers to solve VBS tasks with temporal constraints. We outline the approach and present preliminary results, which indicate the potential of the approach.

Omar Shahbaz Khan, Björn Þór Jónsson, Mathias Larsen, Liam Poulsen, Dennis C. Koelma, Stevan Rudinac, Marcel Worring, Jan Zahálka
VideoGraph – Towards Using Knowledge Graphs for Interactive Video Retrieval

Video is a very expressive medium, able to capture a wide variety of information in different ways. While there have been many advances in the recent past, which enable the annotation of semantic concepts as well as individual objects within video, their larger context has so far not extensively been used for the purpose of retrieval. In this paper, we introduce the first iteration of VideoGraph, a knowledge graph-based video retrieval system. VideoGraph combines information extracted from multiple video modalities with external knowledge bases to produce a semantically enriched representation of the content in a video collection, which can then be retrieved using graph traversal. For the 2021 Video Browser Showdown, we show the first proof-of-concept of such a graph-based video retrieval approach.

Luca Rossetto, Matthias Baumgartner, Narges Ashena, Florian Ruosch, Romana Pernisch, Lucien Heitz, Abraham Bernstein
IVIST: Interactive Video Search Tool in VBS 2021

This paper presents a new version of the Interactive VIdeo Search Tool (IVIST), a video retrieval tool, for the participation of the Video Browser Showdown (VBS) 2021. In the previous IVIST (VBS 2020), there were core functions to search for videos practically, such as object detection, scene-text recognition, and dominant-color finding. Including core functions, we newly supplement other helpful functions to deal with finding videos more effectively: action recognition, place recognition, and description searching methods. These features are expected to enable a more detailed search, especially for human motion and background description which cannot be covered by the previous IVIST system. Furthermore, the user interface has been enhanced in a more user-friendly way. With these enhanced functions, a new version of IVIST can be practical and widely-used for actual users.

Yoonho Lee, Heeju Choi, Sungjune Park, Yong Man Ro
Video Search with Collage Queries

Nowadays, popular web search portals enable users to find available images corresponding to a provided free-form text description. With such sources of example images, a suitable composition/collage of images can be constructed as an appropriate visual query input to a known-item search system. In this paper, we investigate a querying approach enabling users to search videos with a multi-query consisting of positioned example images, so-called collage query, depicting expected objects in a searched scene. The approach relies on images from external search engines, partitioning of preselected representative video frames, relevance scoring based on deep features extracted from images/frames, and is currently integrated into the open-source version of the SOMHunter system providing additional browsing capabilities.

Jakub Lokoč, Jana Bátoryová, Dominik Smrž, Marek Dobranský
Towards Explainable Interactive Multi-modal Video Retrieval with Vitrivr

This paper presents the most recent iteration of the vitrivr multimedia retrieval system for its participation in the Video Browser Showdown (VBS) 2021. Building on existing functionality for interactive multi-modal retrieval, we overhaul query formulation and results presentation for queries which specify temporal context, extend our database with index structures for similarity search and present experimental functionality aimed at improving the explainability of results with the objective of better supporting users in the selection of results and the provision of relevance feedback.

Silvan Heller, Ralph Gasser, Cristina Illi, Maurizio Pasquinelli, Loris Sauter, Florian Spiess, Heiko Schuldt
Competitive Interactive Video Retrieval in Virtual Reality with vitrivr-VR

Virtual Reality (VR) has emerged and developed as a new modality to interact with multimedia data. In this paper, we present vitrivr-VR, a prototype of an interactive multimedia retrieval system in VR based on the open source full-stack multimedia retrieval system vitrivr. We have implemented query formulation tailored to VR: Users can use speech-to-text to search collections via text for concepts, OCR and ASR data as well as entire scene descriptions through a video-text co-embedding feature that embeds sentences and video sequences into the same feature space. Result presentation and relevance feedback in vitrivr-VR leverages the capabilities of virtual spaces.

Florian Spiess, Ralph Gasser, Silvan Heller, Luca Rossetto, Loris Sauter, Heiko Schuldt
An Interactive Video Search Tool: A Case Study Using the V3C1 Dataset

This paper presents a prototype of an interactive video search tool for the preparation of MMM 2021 Video Browser Showdown (VBS). Our tool is tailored to enable searching for the public V3C1 dataset associated with various analysis results including detected objects, speech recognition, and visual features. It supports two types of searches: text-based and visual-based. With a text-based search, the tool enables users for querying videos using their textual descriptions, while with a visual-based search, one provides a video example to search for similar videos. Metadata extracted by recent state-of-the-art computer vision algorithms for object detection and visual features are used for accurate search. For an efficient search, the metadata are managed in two database engines: Whoosh and PostgreSQL. The tool also enables users to refine the search results by providing relevance feedback and customizing the intermediate analysis of the query inputs.

Abdullah Alfarrarjeh, Jungwon Yoon, Seon Ho Kim, Amani Abu Jabal, Akarsh Nagaraj, Chinmayee Siddaramaiah
Less is More - diveXplore 5.0 at VBS 2021

As a longstanding participating system in the annual Video Browser Showdown (VBS2017-VBS2020) as well as in two iterations of the more recently established Lifelog Search Challenge (LSC2018-LSC2019), diveXplore is developed as a feature-rich Deep Interactive Video Exploration system. After its initial successful employment as a competitive tool at the challenges, its performance, however, declined as new features were introduced increasing its overall complexity. We mainly attribute this to the fact that many additions to the system needed to revolve around the system’s core element – an interactive self-organizing browseable featuremap, which, as an integral component did not accommodate the addition of new features well. Therefore, counteracting said performance decline, the VBS 2021 version constitutes a completely rebuilt version 5.0, implemented from scratch with the aim of greatly reducing the system’s complexity as well as keeping proven useful features in a modular manner.

Andreas Leibetseder, Klaus Schoeffmann
SOMHunter V2 at Video Browser Showdown 2021

This paper presents an enhanced version of an interactive video retrieval tool SOMHunter that won Video Browser Showdown 2020. The presented enhancements focus on improving text querying capabilities since the text search model plays a crucial part in successful searches. Hence, we introduce the ability to specify multiple text queries with further positional specification so users can better describe positional relationships of the objects. Moreover, a possibility to further specify text queries with an example image is introduced as well as consequent changes to the user interface of the tool.

Patrik Veselý, František Mejzlík, Jakub Lokoč
W2VV++ BERT Model at VBS 2021

The W2VV++ model BoW variant integrated to VIRET and SOMHunter systems has proven its effectiveness in the previous Video Browser Showdown competition in 2020. As a next experimental interactive search prototype to benchmark, we consider a simple system relying on the more complex BERT variant of the W2VV++ model, accepting a rich text input. The input can be provided by keyboard or by speech processed by a third-party cloud service. The motivation for the more complex BERT variant is its good performance for rich text descriptions that can be provided for known-item search tasks. At the same time, users will be instructed to specify as rich text description about the searched scene as possible.

Ladislav Peška, Gregor Kovalčík, Tomáš Souček, Vít Škrhák, Jakub Lokoč
VISIONE at Video Browser Showdown 2021

This paper presents the second release of VISIONE, a tool for effective video search on large-scale collections. It allows users to search for videos using textual descriptions, keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. One of the main features of our system is that it employs specially designed textual encodings for indexing and searching video content using the mature and scalable Apache Lucene full-text search engine.

Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, Claudio Vairo
IVOS - The ITEC Interactive Video Object Search System at VBS2021

We present IVOS, an interactive video content search system that allows for object-based search and filtering in video archives. The main idea behind is to use the result of recent object detection models to index all keyframes with a manageable set of object classes, and allow the user to filter by different characteristics, such as object name, object location, relative object size, object color, and combinations for different object classes – e.g., “large person in white on the left, with a red tie”. In addition to that, IVOS can also find segments with a specific number of objects of a particular class (e.g., “many apples” or “two people”) and supports similarity search, based on similar object occurrences.

Anja Ressmann, Klaus Schoeffmann
Video Search with Sub-Image Keyword Transfer Using Existing Image Archives

This paper presents details of our frame-based Ad-hoc Video Search system with manually assisted querying that will be used for the Video Browser Showdown 2021 (VBS2021). The main contributions of our new system consist of an improved automatic keywording component, better visual feature vectors which have been fine-tuned for the task of image retrieval, and an improved visual presentation of the search results. Additionally, we use a more powerful joint textual/visual search engine based on Lucene, which can perform a search according to the temporal sequence of textual or visual properties of the video frames.

Nico Hezel, Konstantin Schall, Klaus Jung, Kai Uwe Barthel
A VR Interface for Browsing Visual Spaces at VBS2021

The Video Browser Showdown (VBS) is an annual competition in which each participant prepares an interactive video retrieval system and partakes in a live comparative evaluation at the annual MMM Conference. In this paper, we introduce Eolas, which is a prototype video/image retrieval system incorporating a novel virtual reality (VR) interface. For VBS’21, Eolas represented each keyframe of the collection by an embedded feature in a latent vector space, into which a query would also be projected to facilitate retrieval within a VR environment. A user could then explore the space and perform one of a number of filter operations to traverse the space and locate the correct result.

Ly-Duyen Tran, Manh-Duy Nguyen, Thao-Nhu Nguyen, Graham Healy, Annalina Caputo, Binh T. Nguyen, Cathal Gurrin
Correction to: SQL-Like Interpretable Interactive Video Search

The original version of the book was inadvertently published with an incorrect acknowledgement in chapter 34. The acknowledgement has been corrected and reads as follows:Acknowledgement: The research was partially supported by the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant and the National Natural Science Foundation of China (No. 61872256).The affiliation of the third author, Zhixin Ma, was incorrect. In the contribution it read “School of Information System,” but correctly it should be “School of Computing and Information Systems”.The affiliation of the last author, Chong-Wah Ngo, was not correct. In the book it read “Department of Computer Science, City University of Hong Kong, Hong Kong, China”. Instead, the correct affiliation is: “School of Computing and Information Systems, Singapore Management University, Singapore, Singapore”.Additionally, his e-mail address “” was also incorrect. The correct e-mail address is: “”.The chapter and the book have been updated with the changes.

Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, Chong-Wah Ngo
MultiMedia Modeling
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
Copyright Year
Electronic ISBN
Print ISBN

Premium Partner