MultiMedia Modeling
27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II
- 2021
- Buch
- Herausgegeben von
- Jakub Lokoč
- Prof. Tomáš Skopal
- Prof. Dr. Klaus Schoeffmann
- Vasileios Mezaris
- Dr. Xirong Li
- Dr. Stefanos Vrochidis
- Dr. Ioannis Patras
- Buchreihe
- Lecture Notes in Computer Science
- Verlag
- Springer International Publishing
Über dieses Buch
Über dieses Buch
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.
Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were accepted as well as 2 papers for a demo presentation and 17 papers for participation at the Video Browser Showdown 2021. The papers cover topics such as: multimedia indexing; multimedia mining; multimedia abstraction and summarization; multimedia annotation, tagging and recommendation; multimodal analysis for retrieval applications; semantic analysis of multimedia and contextual data; multimedia fusion methods; multimedia hyperlinking; media content browsing and retrieval tools; media representation and algorithms; audio, image, video processing, coding and compression; multimedia sensors and interaction modes; multimedia privacy, security and content protection; multimedia standards and related issues; advances in multimedia networking and streaming; multimedia databases, content delivery and transport; wireless and mobile multimedia networking; multi-camera and multi-view systems; augmented and virtual reality, virtual environments; real-time and interactive multimedia applications; mobile multimedia applications; multimedia web applications; multimedia authoring and personalization; interactive multimedia and interfaces; sensor networks; social and educational multimedia applications; and emerging trends.
Inhaltsverzeichnis
-
MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset. A Case Study in Ho Chi Minh City, Vietnam
Dang-Hieu Nguyen, Tan-Loc Nguyen-Tai, Minh-Tam Nguyen, Thanh-Binh Nguyen, Minh-Son DaoAbstractThis paper introduces an economical and dynamic crowdsourcing mechanism to collect personal lifelog associated environment datasets, namely MNR-Air. This mechanism’s significant advantage is to use personal sensor boxes that can be carried on citizens (and their vehicles) to collect data. The MNR-HCM dataset is also introduced in this paper as the output of MNR-Air and collected in Ho Chi Minh City, Vietnam. The MNR-HCM dataset contains weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition of urban nature on a personal scale. We also introduce AQI-T-RM, an application that can help people plan their travel to avoid as much air pollution as possible while still saving time on travel. Besides, we discuss how useful MNR-Air is when contributing to the open data science community and other communities that benefit citizens living in urban areas. -
Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy
Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A. Hicks, Vajira Thambawita, Enrique Garcia-Ceja, Michael A. Riegler, Thomas de Lange, Peter T. Schmidt, Håvard D. Johansen, Dag Johansen, Pål HalvorsenAbstractGastrointestinal (GI) pathologies are periodically screened, biopsied, and resected using surgical tools. Usually, the procedures and the treated or resected areas are not specifically tracked or analysed during or after colonoscopies. Information regarding disease borders, development, amount, and size of the resected area get lost. This can lead to poor follow-up and bothersome reassessment difficulties post-treatment. To improve the current standard and also to foster more research on the topic, we have released the “Kvasir-Instrument” dataset, which consists of 590 annotated frames containing GI procedure tools such as snares, balloons, and biopsy forceps, etc. Besides the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists. Additionally, we provide a baseline for the segmentation of the GI tools to promote research and algorithm development. We obtained a dice coefficient score of 0.9158 and a Jaccard index of 0.8578 using a classical U-Net architecture. A similar dice coefficient score was observed for DoubleUNet. The qualitative results showed that the model did not work for the images with specularity and the frames with multiple tools, while the best result for both methods was observed on all other types of images. Both qualitative and quantitative results show that the model performs reasonably good, but there is potential for further improvements. Benchmarking using the dataset provides an opportunity for researchers to contribute to the field of automatic endoscopic diagnostic and therapeutic tool segmentation for GI endoscopy. -
CatMeows: A Publicly-Available Dataset of Cat Vocalizations
Luca A. Ludovico, Stavros Ntalampiras, Giorgio Presti, Simona Cannas, Monica Battini, Silvana MattielloAbstractThis work presents a dataset of cat vocalizations focusing on the meows emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. The dataset contains vocalizations produced by 21 cats belonging to two breeds, namely Maine Coon and European Shorthair. Sounds have been recorded using low-cost devices easily available on the marketplace, and the data acquired are representative of real-world cases both in terms of audio quality and acoustic conditions. The dataset is open-access, released under Creative Commons Attribution 4.0 International licence, and it can be retrieved from the Zenodo web repository. -
Search and Explore Strategies for Interactive Analysis of Real-Life Image Collections with Unknown and Unique Categories
Floris Gisolf, Zeno Geradts, Marcel WorringAbstractMany real-life image collections contain image categories that are unique to that specific image collection and have not been seen before by any human expert analyst nor by a machine. This prevents supervised machine learning to be effective and makes evaluation of such an image collection inefficient. Real-life collections ask for a multimedia analytics solution where the expert performs search and explores the image collection, supported by machine learning algorithms. We propose a method that covers both exploration and search strategies for such complex image collections. Several strategies are evaluated through an artificial user model. Two user studies were performed with experts and students respectively to validate the proposed method. As evaluation of such a method can only be done properly in a real-life application, the proposed method is applied on the MH17 airplane crash photo database on which we have expert knowledge. To show that the proposed method also helps with other image collections an image collection created with the Open Image Database is used. We show that by combining image features extracted with a convolutional neural network pretrained on ImageNet 1k, intelligent use of clustering, a well chosen strategy and expert knowledge, an image collection such as the MH17 airplane crash photo database can be interactively structured into relevant dynamically generated categories, allowing the user to analyse an image collection efficiently. -
Graph-Based Indexing and Retrieval of Lifelog Data
Manh-Duy Nguyen, Binh T. Nguyen, Cathal GurrinAbstractUnderstanding the relationship between objects in an image is an important challenge because it can help to describe actions in the image. In this paper, a graphical data structure, named “Scene Graph”, is utilized to represent an encoded informative visual relationship graph for an image, which we suggest has a wide range of potential applications. This scene graph is applied and tested in the popular domain of lifelogs, and specifically in the challenge of known-item retrieval from lifelogs. In this work, every lifelog image is represented by a scene graph, and at retrieval time, this scene graph is compared with the semantic graph, parsed from a textual query. The result is combined with location or date information to determine the matching items. The experiment shows that this technique can outperform a conventional method. -
On Fusion of Learned and Designed Features for Video Data Analytics
Marek Dobranský, Tomáš SkopalAbstractVideo cameras have become widely used for indoor and outdoor surveillance. Covering more and more public space in cities, the cameras serve various purposes ranging from security to traffic monitoring, urban life, and marketing. However, with the increasing quantity of utilized cameras and recorded streams, manual video monitoring and analysis becomes too laborious. The goal is to obtain effective and efficient artificial intelligence models to process the video data automatically and produce the desired features for data analytics. To this end, we propose a framework for real-time video feature extraction that fuses both learned and hand-designed analytical models and is applicable in real-life situations. Nowadays, state-of-the-art models for various computer vision tasks are implemented by deep learning. However, the exhaustive gathering of labeled training data and the computational complexity of resulting models can often render them impractical. We need to consider the benefits and limitations of each technique and find the synergy between both deep learning and analytical models. Deep learning methods are more suited for simpler tasks on large volumes of dense data while analytical modeling can be sufficient for processing of sparse data with complex structures. Our framework follows those principles by taking advantage of multiple levels of abstraction. In a use case, we show how the framework can be set for an advanced video analysis of urban life. -
XQM: Interactive Learning on Mobile Phones
Alexandra M. Bagi, Kim I. Schild, Omar Shahbaz Khan, Jan Zahálka, Björn Þór JónssonAbstractThere is an increasing need for intelligent interaction with media collections, and mobile phones are gaining significant traction as the device of choice for many users. In this paper, we present XQM, a mobile approach for intelligent interaction with the user’s media on the phone, tackling the inherent challenges of the highly dynamic nature of mobile media collections and limited computational resources of the mobile device. We employ interactive learning, a method that conducts interaction rounds with the user, each consisting of the system suggesting relevant images based on its current model, the user providing relevance labels, the system’s model retraining itself based on these labels, and the system obtaining a new set of suggestions for the next round. This method is suitable for the dynamic nature of mobile media collections and the limited computational resources. We show that XQM, a full-fledged app implemented for Android, operates on 10K image collections in interactive time (less than 1.4 s per interaction round), and evaluate user experience in a user study that confirms XQM’s effectiveness. -
A Multimodal Tensor-Based Late Fusion Approach for Satellite Image Search in Sentinel 2 Images
Ilias Gialampoukidis, Anastasia Moumtzidou, Marios Bakratsas, Stefanos Vrochidis, Ioannis KompatsiarisAbstractEarth Observation (EO) Big Data Collections are acquired at large volumes and variety, due to their high heterogeneous nature. The multimodal character of EO Big Data requires effective combination of multiple modalities for similarity search. We propose a late fusion mechanism of multiple rankings to combine the results from several uni-modal searches in Sentinel 2 image collections. We fist create a K-order tensor from the results of separate searches by visual features, concepts, spatial and temporal information. Visual concepts and features are based on a vector representation from Deep Convolutional Neural Networks. 2D-surfaces of the K-order tensor initially provide candidate retrieved results per ranking position and are merged to obtain the final list of retrieved results. Satellite image patches are used as queries in order to retrieve the most relevant image patches in Sentinel 2 images. Quantitative and qualitative results show that the proposed method outperforms search by a single modality and other late fusion methods. -
Canopy Height Estimation from Spaceborne Imagery Using Convolutional Encoder-Decoder
Leonidas Alagialoglou, Ioannis Manakos, Marco Heurich, Jaroslav Červenka, Anastasios DelopoulosAbstractThe recent advances in multimedia modeling with deep learning methods have significantly affected remote sensing applications, such as canopy height mapping. Estimating canopy height maps in large-scale is an important step towards sustainable ecosystem management. Apart from the standard height estimation method using LiDAR data, other airborne measurement techniques, such as very high-resolution passive airborne imaging, have also shown to provide accurate estimations. However, those methods suffer from high cost and cannot be used at large-scale nor frequently. In our study, we adopt a neural network architecture to estimate pixel-wise canopy height from cost-effective spaceborne imagery. A deep convolutional encoder-decoder network, based on the SegNet architecture together with skip connections, is trained to embed the multi-spectral pixels of a Sentinel-2 input image to height values via end-to-end learned texture features. Experimental results in a study area of 942 \(\mathrm{km}^2\) yield similar or better estimation accuracy resolution in comparison with a method based on costly airborne images as well as with another state-of-the-art deep learning approach based on spaceborne images. -
Implementation of a Random Forest Classifier to Examine Wildfire Predictive Modelling in Greece Using Diachronically Collected Fire Occurrence and Fire Mapping Data
Alexis Apostolakis, Stella Girtsou, Charalampos Kontoes, Ioannis Papoutsis, Michalis TsoutsosAbstractForest fires cause severe damages in ecosystems, human lives and infrastructure globally. This situation tends to get worse in the next decades due to climate change and the expected increase in the length and severity of the fire season. Thus, the ability to develop a method that reliably models the risk of fire occurrence is an important step towards preventing, confronting and limiting the disaster. Different approaches building upon Machine Learning (ML) methods for predicting wildfires and deriving a better understanding of fires’ regimes have been devised. This study demonstrates the development of a Random Forest (RF) classifier to predict “fire”/“non fire” classes in Greece. For this a prototype and representative for the Mediterranean ecosystem database of validated fires and fire related features has been created. The database is populated with data (e.g. Earth Observation derived biophysical parameters and daily collected climatic and weather data) for a period of nine years (2010–2018). Spatially it refers to grid cells of 500 m wide where Active Fires (AF) and Burned Areas/Burn Scars (BSM) were reported during that period. By using feature ranking techniques as Chi-squared and Spearman correlations the study showcases the most significant wildfire triggering variables. It also highlights the extent by which the database and selected features scheme can be used to successfully train a RF classifier for deriving “fire”/“non-fire” predictions over the country of Greece in the prospect of generating a dynamic fire risk system for daily assessments. -
Mobile eHealth Platform for Home Monitoring of Bipolar Disorder
Joan Codina-Filbà, Sergio Escalera, Joan Escudero, Coen Antens, Pau Buch-Cardona, Mireia FarrúsAbstractPeople suffering Bipolar Disorder (BD) experiment changes in mood status having depressive or manic episodes with normal periods in the middle. BD is a chronic disease with a high level of non-adherence to medication that needs a continuous monitoring of patients to detect when they relapse in an episode, so that physicians can take care of them. Here we present MoodRecord, an easy-to-use, non-intrusive, multilingual, robust and scalable platform suitable for home monitoring patients with BD, that allows physicians and relatives to track the patient state and get alarms when abnormalities occur.MoodRecord takes advantage of the capabilities of smartphones as a communication and recording device to do a continuous monitoring of patients. It automatically records user activity, and asks the user to answer some questions or to record himself in video, according to a predefined plan designed by physicians. The video is analysed, recognising the mood status from images and bipolar assessment scores are extracted from speech parameters. The data obtained from the different sources are merged periodically to observe if a relapse may start and if so, raise the corresponding alarm. The application got a positive evaluation in a pilot with users from three different countries. During the pilot, the predictions of the voice and image modules showed a coherent correlation with the diagnosis performed by clinicians. -
Multimodal Sensor Data Analysis for Detection of Risk Situations of Fragile People in @home Environments
Thinhinane Yebda, Jenny Benois-Pineau, Marion Pech, Hélène Amieva, Laura Middleton, Max BergeltAbstractMultimedia (MM) nowadays often means “Multimodality”. The target application area of MM technologies further extends to healthcare. Health parameters monitoring, context and situational recognition in ambient assisted living - all these applications require tailored solutions. We are interested in development of AI solutions for prevention of risk situations of fragile people living at home. This research requires a tight collaboration of IT researchers with psychologists and kinesiologists. In this paper we present a large collaborative project between such actors for developing future solutions of risk situations detection of fragile people. We report on definition of risk scenarios which have been simulated in the data collected with the developed Android application. Adapted annotation scenarios for sensory and visual data are elaborated. A pilot corpus recorded with healthy volunteers in everyday life situations is presented. Preliminary detection results on LSC dataset show the complexity of real-life recognition tasks. -
Towards the Development of a Trustworthy Chatbot for Mental Health Applications
Matthias Kraus, Philip Seldschopf, Wolfgang MinkerAbstractResearch on conversational chatbots for mental health applications is an emerging topic. Current work focuses primarily on the usability and acceptance of such systems. However, the human-computer trust relationship is often overlooked, even though being highly important for the acceptance of chatbots in a clinical environment. This paper presents the creation and evaluation of a trustworthy agent using relational and proactive dialogue. A pilot study with non-clinical subjects showed that a relational strategy using empathetic reactions and small-talk failed to foster human-computer trust. However, changing the initiative to be more proactive seems to be welcomed as it is perceived more reliable and understandable by users. -
Fusion of Multimodal Sensor Data for Effective Human Action Recognition in the Service of Medical Platforms
Panagiotis Giannakeris, Athina Tsanousa, Thanasis Mavropoulos, Georgios Meditskos, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis KompatsiarisAbstractIn what has arguably been one of the most troubling periods of recent medical history, with a global pandemic emphasising the importance of staying healthy, innovative tools that shelter patient well-being gain momentum. In that view, a framework is proposed that leverages multimodal data, namely inertial and depth sensor-originating data, can be integrated in health care-oriented platforms, and tackles the crucial task of human action recognition (HAR). To analyse person movement and consequently assess the patient’s condition, an effective methodology is presented that is two-fold: initially, Kinect-based action representations are constructed from handcrafted 3DHOG depth features and the descriptive power of a Fisher encoding scheme. This is complemented by wearable sensor data analysis, using time domain features and then boosted by exploring fusion strategies of minimum expense. Finally, an extended experimental process reveals competitive results in a well-known benchmark dataset and indicates the applicability of our methodology for HAR. -
SpotifyGraph: Visualisation of User’s Preferences in Music
Pavel Gajdusek, Ladislav PeskaAbstractMany music streaming portals recommend lists of songs to the users. These recommendations are often results of black-box algorithms (from the user’s perspective). However, irrelevant recommendations without the proper justification may considerably hinder the user’s trust. Moreover, user profiles in music streaming services tend to be very large, consisting of hundreds of artists and thousands of tracks. So, not only the recommendation procedure details are hidden for the user, but he/she often lacks a sufficient knowledge about the source data the recommendations are derived from. In order to cope with these challenges, we propose SpotifyGraph application. The application aims on a comprehensible visualization of the relations within the Spotify user’s profile and therefore improve understandability of provided recommendations. -
A System for Interactive Multimedia Retrieval Evaluations
Luca Rossetto, Ralph Gasser, Loris Sauter, Abraham Bernstein, Heiko SchuldtAbstractThe evaluation of the performance of interactive multimedia retrieval systems is a methodologically non-trivial endeavour and requires specialized infrastructure. Current evaluation campaigns have so far relied on a local setting, where all retrieval systems needed to be evaluated at the same physical location at the same time. This constraint does not only complicate the organization and coordination but also limits the number of systems which can reasonably be evaluated within a set time frame. Travel restrictions might further limit the possibility for such evaluations. To address these problems, evaluations need to be conducted in a (geographically) distributed setting, which was so far not possible due to the lack of supporting infrastructure. In this paper, we present the Distributed Retrieval Evaluation Server (DRES), an open-source evaluation system to facilitate evaluation campaigns for interactive multimedia retrieval systems in both traditional on-site as well as fully distributed settings which has already proven effective in a competitive evaluation. -
SQL-Like Interpretable Interactive Video Search
Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, Chong-Wah NgoAbstractConcept-free search, which embeds text and video signals in a joint space for retrieval, appears to be a new state-of-the-art. However, this new search paradigm suffers from two limitations. First, the search result is unpredictable and not interpretable. Second, the embedded features are in high-dimensional space hindering real-time indexing and search. In this paper, we present a new implementation of the Vireo video search system (Vireo-VSS), which employs a dual-task model to index each video segment with an embedding feature in a low dimension and a concept list for retrieval. The concept list serves as a reference to interpret its associated embedded feature. With these changes, a SQL-like querying interface is designed such that a user can specify the search content (subject, predicate, object) and constraint (logical condition) in a semi-structured way. The system will decompose the SQL-like query into multiple sub-queries depending on the constraint being specified. Each sub-query is translated into an embedding feature and a concept list for video retrieval. The search result is compiled by union or pruning of the search lists from multiple sub-queries. The SQL-like interface is also extended for temporal querying, by providing multiple SQL templates for users to specify the temporal evolution of a query. -
VERGE in VBS 2021
Stelios Andreadis, Anastasia Moumtzidou, Konstantinos Gkountakos, Nick Pantelidis, Konstantinos Apostolidis, Damianos Galanopoulos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis KompatsiarisAbstractThis paper presents VERGE, an interactive video search engine that supports efficient browsing and searching into a collection of images or videos. The framework involves a variety of retrieval approaches as well as reranking and fusion capabilities. A Web application enables users to create queries and view the results in a fast and friendly manner.
- Titel
- MultiMedia Modeling
- Herausgegeben von
-
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
- Copyright-Jahr
- 2021
- Electronic ISBN
- 978-3-030-67835-7
- Print ISBN
- 978-3-030-67834-0
- DOI
- https://doi.org/10.1007/978-3-030-67835-7
Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.