Zum Inhalt

MultiMedia Modeling

27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II

  • 2021
  • Buch

Über dieses Buch

Die zweibändige Reihe LNCS 12572 und 1273 stellt den gründlich referierten Tagungsband der 27. Internationalen Konferenz für Multimedia-Modellierung, MMM 2021, dar, die im Juni 2021 in Prag, Tschechische Republik, stattfand. Von den 211 eingereichten regulären Vorträgen wurden 40 für eine mündliche Präsentation und 33 für eine Posterpräsentation ausgewählt; 16 Sondervorträge wurden angenommen sowie 2 Vorträge für eine Demopräsentation und 17 Vorträge für die Teilnahme am Video Browser Showdown 2021. Die Beiträge behandeln Themen wie Multimedia-Indexierung, Multimedia-Mining, multimediale Abstraktion und Zusammenfassung, multimediale Anmerkungen, Tagging und Empfehlungen, multimodale Analyse für Retrieval-Anwendungen, semantische Analyse multimedialer und kontextbezogener Daten, multimediale Fusionsmethoden, multimediale Hyperlinks, Browsing und Retrieval-Tools für Medieninhalte, Medienrepräsentation und -algorithmen, Audio-, Bild-, Videoverarbeitung, Codierung und Komprimierung, multimediale Sensoren und Interaktionsmodi, multimediale Datenschutz-, Sicherheits- und Inhaltsschutz, multimediale Standards und verwandte Themen, Fortschritte bei Multimedia-Vernetzung und -Streaming, multimediale Datenbanken, Bereitstellung und Transport von Inhalten, drahtlose und mobile Multimedia-Netzwerke,

Inhaltsverzeichnis

Nächste
  • 1
  • current Page 2
  • 3
Vorherige
  1. MNR-Air: An Economic and Dynamic Crowdsourcing Mechanism to Collect Personal Lifelog and Surrounding Environment Dataset. A Case Study in Ho Chi Minh City, Vietnam

    Dang-Hieu Nguyen, Tan-Loc Nguyen-Tai, Minh-Tam Nguyen, Thanh-Binh Nguyen, Minh-Son Dao
    Abstract
    This paper introduces an economical and dynamic crowdsourcing mechanism to collect personal lifelog associated environment datasets, namely MNR-Air. This mechanism’s significant advantage is to use personal sensor boxes that can be carried on citizens (and their vehicles) to collect data. The MNR-HCM dataset is also introduced in this paper as the output of MNR-Air and collected in Ho Chi Minh City, Vietnam. The MNR-HCM dataset contains weather data, air pollution data, GPS data, lifelog images, and citizens’ cognition of urban nature on a personal scale. We also introduce AQI-T-RM, an application that can help people plan their travel to avoid as much air pollution as possible while still saving time on travel. Besides, we discuss how useful MNR-Air is when contributing to the open data science community and other communities that benefit citizens living in urban areas.
  2. Kvasir-Instrument: Diagnostic and Therapeutic Tool Segmentation Dataset in Gastrointestinal Endoscopy

    Debesh Jha, Sharib Ali, Krister Emanuelsen, Steven A. Hicks, Vajira Thambawita, Enrique Garcia-Ceja, Michael A. Riegler, Thomas de Lange, Peter T. Schmidt, Håvard D. Johansen, Dag Johansen, Pål Halvorsen
    Abstract
    Gastrointestinal (GI) pathologies are periodically screened, biopsied, and resected using surgical tools. Usually, the procedures and the treated or resected areas are not specifically tracked or analysed during or after colonoscopies. Information regarding disease borders, development, amount, and size of the resected area get lost. This can lead to poor follow-up and bothersome reassessment difficulties post-treatment. To improve the current standard and also to foster more research on the topic, we have released the “Kvasir-Instrument” dataset, which consists of 590 annotated frames containing GI procedure tools such as snares, balloons, and biopsy forceps, etc. Besides the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists. Additionally, we provide a baseline for the segmentation of the GI tools to promote research and algorithm development. We obtained a dice coefficient score of 0.9158 and a Jaccard index of 0.8578 using a classical U-Net architecture. A similar dice coefficient score was observed for DoubleUNet. The qualitative results showed that the model did not work for the images with specularity and the frames with multiple tools, while the best result for both methods was observed on all other types of images. Both qualitative and quantitative results show that the model performs reasonably good, but there is potential for further improvements. Benchmarking using the dataset provides an opportunity for researchers to contribute to the field of automatic endoscopic diagnostic and therapeutic tool segmentation for GI endoscopy.
  3. CatMeows: A Publicly-Available Dataset of Cat Vocalizations

    Luca A. Ludovico, Stavros Ntalampiras, Giorgio Presti, Simona Cannas, Monica Battini, Silvana Mattiello
    Abstract
    This work presents a dataset of cat vocalizations focusing on the meows emitted in three different contexts: brushing, isolation in an unfamiliar environment, and waiting for food. The dataset contains vocalizations produced by 21 cats belonging to two breeds, namely Maine Coon and European Shorthair. Sounds have been recorded using low-cost devices easily available on the marketplace, and the data acquired are representative of real-world cases both in terms of audio quality and acoustic conditions. The dataset is open-access, released under Creative Commons Attribution 4.0 International licence, and it can be retrieved from the Zenodo web repository.
  4. Search and Explore Strategies for Interactive Analysis of Real-Life Image Collections with Unknown and Unique Categories

    Floris Gisolf, Zeno Geradts, Marcel Worring
    Abstract
    Many real-life image collections contain image categories that are unique to that specific image collection and have not been seen before by any human expert analyst nor by a machine. This prevents supervised machine learning to be effective and makes evaluation of such an image collection inefficient. Real-life collections ask for a multimedia analytics solution where the expert performs search and explores the image collection, supported by machine learning algorithms. We propose a method that covers both exploration and search strategies for such complex image collections. Several strategies are evaluated through an artificial user model. Two user studies were performed with experts and students respectively to validate the proposed method. As evaluation of such a method can only be done properly in a real-life application, the proposed method is applied on the MH17 airplane crash photo database on which we have expert knowledge. To show that the proposed method also helps with other image collections an image collection created with the Open Image Database is used. We show that by combining image features extracted with a convolutional neural network pretrained on ImageNet 1k, intelligent use of clustering, a well chosen strategy and expert knowledge, an image collection such as the MH17 airplane crash photo database can be interactively structured into relevant dynamically generated categories, allowing the user to analyse an image collection efficiently.
  5. Graph-Based Indexing and Retrieval of Lifelog Data

    Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin
    Abstract
    Understanding the relationship between objects in an image is an important challenge because it can help to describe actions in the image. In this paper, a graphical data structure, named “Scene Graph”, is utilized to represent an encoded informative visual relationship graph for an image, which we suggest has a wide range of potential applications. This scene graph is applied and tested in the popular domain of lifelogs, and specifically in the challenge of known-item retrieval from lifelogs. In this work, every lifelog image is represented by a scene graph, and at retrieval time, this scene graph is compared with the semantic graph, parsed from a textual query. The result is combined with location or date information to determine the matching items. The experiment shows that this technique can outperform a conventional method.
  6. On Fusion of Learned and Designed Features for Video Data Analytics

    Marek Dobranský, Tomáš Skopal
    Abstract
    Video cameras have become widely used for indoor and outdoor surveillance. Covering more and more public space in cities, the cameras serve various purposes ranging from security to traffic monitoring, urban life, and marketing. However, with the increasing quantity of utilized cameras and recorded streams, manual video monitoring and analysis becomes too laborious. The goal is to obtain effective and efficient artificial intelligence models to process the video data automatically and produce the desired features for data analytics. To this end, we propose a framework for real-time video feature extraction that fuses both learned and hand-designed analytical models and is applicable in real-life situations. Nowadays, state-of-the-art models for various computer vision tasks are implemented by deep learning. However, the exhaustive gathering of labeled training data and the computational complexity of resulting models can often render them impractical. We need to consider the benefits and limitations of each technique and find the synergy between both deep learning and analytical models. Deep learning methods are more suited for simpler tasks on large volumes of dense data while analytical modeling can be sufficient for processing of sparse data with complex structures. Our framework follows those principles by taking advantage of multiple levels of abstraction. In a use case, we show how the framework can be set for an advanced video analysis of urban life.
  7. XQM: Interactive Learning on Mobile Phones

    Alexandra M. Bagi, Kim I. Schild, Omar Shahbaz Khan, Jan Zahálka, Björn Þór Jónsson
    Abstract
    There is an increasing need for intelligent interaction with media collections, and mobile phones are gaining significant traction as the device of choice for many users. In this paper, we present XQM, a mobile approach for intelligent interaction with the user’s media on the phone, tackling the inherent challenges of the highly dynamic nature of mobile media collections and limited computational resources of the mobile device. We employ interactive learning, a method that conducts interaction rounds with the user, each consisting of the system suggesting relevant images based on its current model, the user providing relevance labels, the system’s model retraining itself based on these labels, and the system obtaining a new set of suggestions for the next round. This method is suitable for the dynamic nature of mobile media collections and the limited computational resources. We show that XQM, a full-fledged app implemented for Android, operates on 10K image collections in interactive time (less than 1.4 s per interaction round), and evaluate user experience in a user study that confirms XQM’s effectiveness.
  8. A Multimodal Tensor-Based Late Fusion Approach for Satellite Image Search in Sentinel 2 Images

    Ilias Gialampoukidis, Anastasia Moumtzidou, Marios Bakratsas, Stefanos Vrochidis, Ioannis Kompatsiaris
    Abstract
    Earth Observation (EO) Big Data Collections are acquired at large volumes and variety, due to their high heterogeneous nature. The multimodal character of EO Big Data requires effective combination of multiple modalities for similarity search. We propose a late fusion mechanism of multiple rankings to combine the results from several uni-modal searches in Sentinel 2 image collections. We fist create a K-order tensor from the results of separate searches by visual features, concepts, spatial and temporal information. Visual concepts and features are based on a vector representation from Deep Convolutional Neural Networks. 2D-surfaces of the K-order tensor initially provide candidate retrieved results per ranking position and are merged to obtain the final list of retrieved results. Satellite image patches are used as queries in order to retrieve the most relevant image patches in Sentinel 2 images. Quantitative and qualitative results show that the proposed method outperforms search by a single modality and other late fusion methods.
  9. Canopy Height Estimation from Spaceborne Imagery Using Convolutional Encoder-Decoder

    Leonidas Alagialoglou, Ioannis Manakos, Marco Heurich, Jaroslav Červenka, Anastasios Delopoulos
    Abstract
    The recent advances in multimedia modeling with deep learning methods have significantly affected remote sensing applications, such as canopy height mapping. Estimating canopy height maps in large-scale is an important step towards sustainable ecosystem management. Apart from the standard height estimation method using LiDAR data, other airborne measurement techniques, such as very high-resolution passive airborne imaging, have also shown to provide accurate estimations. However, those methods suffer from high cost and cannot be used at large-scale nor frequently. In our study, we adopt a neural network architecture to estimate pixel-wise canopy height from cost-effective spaceborne imagery. A deep convolutional encoder-decoder network, based on the SegNet architecture together with skip connections, is trained to embed the multi-spectral pixels of a Sentinel-2 input image to height values via end-to-end learned texture features. Experimental results in a study area of 942 \(\mathrm{km}^2\) yield similar or better estimation accuracy resolution in comparison with a method based on costly airborne images as well as with another state-of-the-art deep learning approach based on spaceborne images.
  10. Implementation of a Random Forest Classifier to Examine Wildfire Predictive Modelling in Greece Using Diachronically Collected Fire Occurrence and Fire Mapping Data

    Alexis Apostolakis, Stella Girtsou, Charalampos Kontoes, Ioannis Papoutsis, Michalis Tsoutsos
    Abstract
    Forest fires cause severe damages in ecosystems, human lives and infrastructure globally. This situation tends to get worse in the next decades due to climate change and the expected increase in the length and severity of the fire season. Thus, the ability to develop a method that reliably models the risk of fire occurrence is an important step towards preventing, confronting and limiting the disaster. Different approaches building upon Machine Learning (ML) methods for predicting wildfires and deriving a better understanding of fires’ regimes have been devised. This study demonstrates the development of a Random Forest (RF) classifier to predict “fire”/“non fire” classes in Greece. For this a prototype and representative for the Mediterranean ecosystem database of validated fires and fire related features has been created. The database is populated with data (e.g. Earth Observation derived biophysical parameters and daily collected climatic and weather data) for a period of nine years (2010–2018). Spatially it refers to grid cells of 500 m wide where Active Fires (AF) and Burned Areas/Burn Scars (BSM) were reported during that period. By using feature ranking techniques as Chi-squared and Spearman correlations the study showcases the most significant wildfire triggering variables. It also highlights the extent by which the database and selected features scheme can be used to successfully train a RF classifier for deriving “fire”/“non-fire” predictions over the country of Greece in the prospect of generating a dynamic fire risk system for daily assessments.
  11. Mobile eHealth Platform for Home Monitoring of Bipolar Disorder

    Joan Codina-Filbà, Sergio Escalera, Joan Escudero, Coen Antens, Pau Buch-Cardona, Mireia Farrús
    Abstract
    People suffering Bipolar Disorder (BD) experiment changes in mood status having depressive or manic episodes with normal periods in the middle. BD is a chronic disease with a high level of non-adherence to medication that needs a continuous monitoring of patients to detect when they relapse in an episode, so that physicians can take care of them. Here we present MoodRecord, an easy-to-use, non-intrusive, multilingual, robust and scalable platform suitable for home monitoring patients with BD, that allows physicians and relatives to track the patient state and get alarms when abnormalities occur.
    MoodRecord takes advantage of the capabilities of smartphones as a communication and recording device to do a continuous monitoring of patients. It automatically records user activity, and asks the user to answer some questions or to record himself in video, according to a predefined plan designed by physicians. The video is analysed, recognising the mood status from images and bipolar assessment scores are extracted from speech parameters. The data obtained from the different sources are merged periodically to observe if a relapse may start and if so, raise the corresponding alarm. The application got a positive evaluation in a pilot with users from three different countries. During the pilot, the predictions of the voice and image modules showed a coherent correlation with the diagnosis performed by clinicians.
  12. Multimodal Sensor Data Analysis for Detection of Risk Situations of Fragile People in @home Environments

    Thinhinane Yebda, Jenny Benois-Pineau, Marion Pech, Hélène Amieva, Laura Middleton, Max Bergelt
    Abstract
    Multimedia (MM) nowadays often means “Multimodality”. The target application area of MM technologies further extends to healthcare. Health parameters monitoring, context and situational recognition in ambient assisted living - all these applications require tailored solutions. We are interested in development of AI solutions for prevention of risk situations of fragile people living at home. This research requires a tight collaboration of IT researchers with psychologists and kinesiologists. In this paper we present a large collaborative project between such actors for developing future solutions of risk situations detection of fragile people. We report on definition of risk scenarios which have been simulated in the data collected with the developed Android application. Adapted annotation scenarios for sensory and visual data are elaborated. A pilot corpus recorded with healthy volunteers in everyday life situations is presented. Preliminary detection results on LSC dataset show the complexity of real-life recognition tasks.
  13. Towards the Development of a Trustworthy Chatbot for Mental Health Applications

    Matthias Kraus, Philip Seldschopf, Wolfgang Minker
    Abstract
    Research on conversational chatbots for mental health applications is an emerging topic. Current work focuses primarily on the usability and acceptance of such systems. However, the human-computer trust relationship is often overlooked, even though being highly important for the acceptance of chatbots in a clinical environment. This paper presents the creation and evaluation of a trustworthy agent using relational and proactive dialogue. A pilot study with non-clinical subjects showed that a relational strategy using empathetic reactions and small-talk failed to foster human-computer trust. However, changing the initiative to be more proactive seems to be welcomed as it is perceived more reliable and understandable by users.
  14. Fusion of Multimodal Sensor Data for Effective Human Action Recognition in the Service of Medical Platforms

    Panagiotis Giannakeris, Athina Tsanousa, Thanasis Mavropoulos, Georgios Meditskos, Konstantinos Ioannidis, Stefanos Vrochidis, Ioannis Kompatsiaris
    Abstract
    In what has arguably been one of the most troubling periods of recent medical history, with a global pandemic emphasising the importance of staying healthy, innovative tools that shelter patient well-being gain momentum. In that view, a framework is proposed that leverages multimodal data, namely inertial and depth sensor-originating data, can be integrated in health care-oriented platforms, and tackles the crucial task of human action recognition (HAR). To analyse person movement and consequently assess the patient’s condition, an effective methodology is presented that is two-fold: initially, Kinect-based action representations are constructed from handcrafted 3DHOG depth features and the descriptive power of a Fisher encoding scheme. This is complemented by wearable sensor data analysis, using time domain features and then boosted by exploring fusion strategies of minimum expense. Finally, an extended experimental process reveals competitive results in a well-known benchmark dataset and indicates the applicability of our methodology for HAR.
  15. SpotifyGraph: Visualisation of User’s Preferences in Music

    Pavel Gajdusek, Ladislav Peska
    Abstract
    Many music streaming portals recommend lists of songs to the users. These recommendations are often results of black-box algorithms (from the user’s perspective). However, irrelevant recommendations without the proper justification may considerably hinder the user’s trust. Moreover, user profiles in music streaming services tend to be very large, consisting of hundreds of artists and thousands of tracks. So, not only the recommendation procedure details are hidden for the user, but he/she often lacks a sufficient knowledge about the source data the recommendations are derived from. In order to cope with these challenges, we propose SpotifyGraph application. The application aims on a comprehensible visualization of the relations within the Spotify user’s profile and therefore improve understandability of provided recommendations.
  16. A System for Interactive Multimedia Retrieval Evaluations

    Luca Rossetto, Ralph Gasser, Loris Sauter, Abraham Bernstein, Heiko Schuldt
    Abstract
    The evaluation of the performance of interactive multimedia retrieval systems is a methodologically non-trivial endeavour and requires specialized infrastructure. Current evaluation campaigns have so far relied on a local setting, where all retrieval systems needed to be evaluated at the same physical location at the same time. This constraint does not only complicate the organization and coordination but also limits the number of systems which can reasonably be evaluated within a set time frame. Travel restrictions might further limit the possibility for such evaluations. To address these problems, evaluations need to be conducted in a (geographically) distributed setting, which was so far not possible due to the lack of supporting infrastructure. In this paper, we present the Distributed Retrieval Evaluation Server (DRES), an open-source evaluation system to facilitate evaluation campaigns for interactive multimedia retrieval systems in both traditional on-site as well as fully distributed settings which has already proven effective in a competitive evaluation.
  17. SQL-Like Interpretable Interactive Video Search

    Jiaxin Wu, Phuong Anh Nguyen, Zhixin Ma, Chong-Wah Ngo
    Abstract
    Concept-free search, which embeds text and video signals in a joint space for retrieval, appears to be a new state-of-the-art. However, this new search paradigm suffers from two limitations. First, the search result is unpredictable and not interpretable. Second, the embedded features are in high-dimensional space hindering real-time indexing and search. In this paper, we present a new implementation of the Vireo video search system (Vireo-VSS), which employs a dual-task model to index each video segment with an embedding feature in a low dimension and a concept list for retrieval. The concept list serves as a reference to interpret its associated embedded feature. With these changes, a SQL-like querying interface is designed such that a user can specify the search content (subject, predicate, object) and constraint (logical condition) in a semi-structured way. The system will decompose the SQL-like query into multiple sub-queries depending on the constraint being specified. Each sub-query is translated into an embedding feature and a concept list for video retrieval. The search result is compiled by union or pruning of the search lists from multiple sub-queries. The SQL-like interface is also extended for temporal querying, by providing multiple SQL templates for users to specify the temporal evolution of a query.
  18. VERGE in VBS 2021

    Stelios Andreadis, Anastasia Moumtzidou, Konstantinos Gkountakos, Nick Pantelidis, Konstantinos Apostolidis, Damianos Galanopoulos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris
    Abstract
    This paper presents VERGE, an interactive video search engine that supports efficient browsing and searching into a collection of images or videos. The framework involves a variety of retrieval approaches as well as reranking and fusion capabilities. A Web application enables users to create queries and view the results in a fast and friendly manner.
Nächste
  • 1
  • current Page 2
  • 3
Vorherige
Titel
MultiMedia Modeling
Herausgegeben von
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
Copyright-Jahr
2021
Electronic ISBN
978-3-030-67835-7
Print ISBN
978-3-030-67834-0
DOI
https://doi.org/10.1007/978-3-030-67835-7

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, ams.solutions GmbH/© ams.solutions GmbH, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, Ferrari electronic AG/© Ferrari electronic AG, Doxee AT GmbH/© Doxee AT GmbH , Haufe Group SE/© Haufe Group SE, NTT Data/© NTT Data, Bild 1 Verspätete Verkaufsaufträge (Sage-Advertorial 3/2026)/© Sage, IT-Director und IT-Mittelstand: Ihre Webinar-Matineen in 2025 und 2026/© amgun | Getty Images