Skip to main content
Top

MultiMedia Modeling

27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II

  • 2021
  • Book

About this book

The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.

Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were accepted as well as 2 papers for a demo presentation and 17 papers for participation at the Video Browser Showdown 2021. The papers cover topics such as: multimedia indexing; multimedia mining; multimedia abstraction and summarization; multimedia annotation, tagging and recommendation; multimodal analysis for retrieval applications; semantic analysis of multimedia and contextual data; multimedia fusion methods; multimedia hyperlinking; media content browsing and retrieval tools; media representation and algorithms; audio, image, video processing, coding and compression; multimedia sensors and interaction modes; multimedia privacy, security and content protection; multimedia standards and related issues; advances in multimedia networking and streaming; multimedia databases, content delivery and transport; wireless and mobile multimedia networking; multi-camera and multi-view systems; augmented and virtual reality, virtual environments; real-time and interactive multimedia applications; mobile multimedia applications; multimedia web applications; multimedia authoring and personalization; interactive multimedia and interfaces; sensor networks; social and educational multimedia applications; and emerging trends.

Table of Contents

Next
  • current Page 1
  • 2
  • 3
  1. Frontmatter

  2. MSCANet: Adaptive Multi-scale Context Aggregation Network for Congested Crowd Counting

    Yani Zhang, Huailin Zhao, Fangbo Zhou, Qing Zhang, Yanjiao Shi, Lanjun Liang
    The chapter introduces MSCANet, a pioneering approach to congested crowd counting using an adaptive multi-scale context aggregation module (MSCA). MSCANet effectively handles challenges such as occlusion and scale variations by adaptively aggregating contextual information across different scales. This innovative method leverages atrous convolution with varying dilation rates and a channel attention mechanism to produce rich, global scene representations. The network's architecture, comprising multiple MSCA modules, allows it to efficiently process and regress crowd density maps. Extensive experiments on benchmark datasets demonstrate MSCANet's superior performance over state-of-the-art methods, particularly in congested crowd scenes. The chapter also delves into the design and effectiveness of the MSCA module, highlighting its advantages over other context-based approaches.
  3. Tropical Cyclones Tracking Based on Satellite Cloud Images: Database and Comprehensive Study

    Cheng Huang, Sixian Chan, Cong Bai, Weilong Ding, Jinglin Zhang
    This chapter introduces TCTSCI, a pioneering database designed for tracking tropical cyclones using satellite cloud images and comprehensive meteorological data. It addresses the critical need for accurate tropical cyclone forecasting by leveraging advanced satellite imagery and meteorological data. The chapter evaluates various state-of-the-art trackers on the TCTSCI database, highlighting the superior performance of deep-learning-based trackers. It also underscores the potential of multi-modal data-based trackers for enhancing tropical cyclone tracking. The evaluation results and the unique challenges posed by the database offer valuable insights for future research in this interdisciplinary field.
  4. Image Registration Improved by Generative Adversarial Networks

    Shiyan Jiang, Ci Wang, Chang Huang
    The chapter 'Image Registration Improved by Generative Adversarial Networks' delves into the challenges of registering images captured by mobile phones, which are often plagued by distortions such as noise, blur, and uneven degradation. Traditional image registration methods, including template-based, domain-based, and feature-based approaches, are discussed and found to be insufficient for handling these distortions. The authors propose a novel GAN architecture that combines pixel-wise loss and content loss to enhance image quality for better registration. The chapter also introduces a synthetic loss function and a robust network structure, including a generator and discriminator network, to achieve competitive balance. Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed method in enhancing image quality and increasing the number of matching points for registration. The chapter concludes by highlighting the practical applications of the enhanced image registration in mobile phone imaging.
  5. Deep 3D Modeling of Human Bodies from Freehand Sketching

    Kaizhi Yang, Jintao Lu, Siyu Hu, Xuejin Chen
    The chapter delves into the advanced techniques of deep 3D modeling of human bodies from freehand sketches, addressing the challenges posed by the non-rigid nature and articulation of human bodies. By employing deep neural networks, the authors propose a skeleton-aware interpretation neural network that effectively maps coarse and sparse sketches to high-quality body meshes. The method combines non-parametric joint regression with parametric body representation, utilizing the SMPL model to produce naturally-looking 3D models. The chapter also highlights the creation of a large-scale dataset for training and evaluating the model, demonstrating the system's effectiveness through quantitative and qualitative tests. The innovative approach enables common users to create and edit high-quality 3D body models interactively, showcasing the potential for practical applications in various fields.
  6. Two-Stage Real-Time Multi-object Tracking with Candidate Selection

    Fan Wang, Lei Luo, En Zhu
    The chapter delves into the critical challenges of real-time multi-object tracking (MOT) in computer vision, emphasizing the need for efficient data association, object detection, and appearance embedding. It introduces a two-stage deep network that not only accomplishes object detection and appearance embedding simultaneously but also employs a candidate selection process to mitigate the impact of unreliable detection results. The proposed method leverages a cascade data association strategy, utilizing spatial information and deeply learned person re-identification features to achieve accurate and real-time tracking. The chapter also highlights the experimental results on the MOT16 dataset, showcasing the method's superior performance in metrics such as MOTA and IDF1 compared to other state-of-the-art trackers. Additionally, ablation studies demonstrate the effectiveness of the candidate selection and embedding feature components in improving tracking performance. Overall, the chapter offers a comprehensive and innovative approach to real-time MOT, making it a valuable read for specialists in the field.
  7. Tell as You Imagine: Sentence Imageability-Aware Image Captioning

    Kazuki Umemura, Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Keisuke Doman, Daisuke Deguchi, Hiroshi Murase
    The chapter 'Tell as You Imagine: Sentence Imageability-Aware Image Captioning' presents a groundbreaking method for generating image captions with different levels of visual descriptiveness. By incorporating the psycholinguistic concept of 'imageability,' the proposed model can tailor captions to various applications, such as news articles or visually impaired assistance. The method involves augmenting image caption datasets, calculating sentence imageability scores, and modifying existing image captioning models to generate diverse captions. Experiments demonstrate the effectiveness of this approach, showing that the generated captions match the desired imageability scores and are perceived as such by human evaluators. This work opens new avenues for more context-aware and application-specific image captioning systems.
  8. Deep Face Swapping via Cross-Identity Adversarial Training

    Shuhui Yang, Han Xue, Jun Ling, Li Song, Rong Xie
    This chapter delves into the latest developments in face swapping technology, highlighting the challenges and limitations of existing auto-encoder based methods. It introduces a groundbreaking cross-identity adversarial training framework designed to address these issues. By incorporating spatial attention mechanisms and robust adversarial training strategies, the proposed approach achieves highly realistic face swapping results, even in complex illumination conditions. The chapter presents extensive experiments and quantitative analyses, demonstrating the superior performance of the proposed method over state-of-the-art techniques. Additionally, it explores the model's robustness in challenging scenarios such as cross-gender and cross-race face swapping, showcasing its potential in diverse applications across various industries.
  9. Res2-Unet: An Enhanced Network for Generalized Nuclear Segmentation in Pathological Images

    Shuai Zhao, Xuanya Li, Zhineng Chen, Chang Liu, Changgen Peng
    The chapter delves into the challenges and advancements in nuclear segmentation within digital pathology. It introduces Res2-Unet, an enhanced U-Net architecture designed to improve the accuracy and generalization of nuclear segmentation in pathological images. The authors address the complexities of nuclear segmentation, such as varying staining, overlapping nuclei, and tumor heterogeneity. Res2-Unet incorporates advanced residual modules and attention mechanisms to enhance feature extraction and fusion, leading to superior performance in segmentation tasks. The chapter presents extensive experiments on public datasets, demonstrating the effectiveness of Res2-Unet in achieving accurate and generalized nuclear segmentation. The results highlight the potential of deep learning techniques in advancing the field of digital pathology and medical imaging.
  10. Automatic Diagnosis of Glaucoma on Color Fundus Images Using Adaptive Mask Deep Network

    Gang Yang, Fan Li, Dayong Ding, Jun Wu, Jie Xu
    The chapter delves into the critical issue of glaucoma diagnosis, a leading cause of vision loss worldwide. It highlights the challenges in manual diagnosis and the potential of AI in addressing these issues. The authors introduce AMNet, a deep learning model that incorporates an attention mechanism to focus on key diagnostic areas such as the optic disc and retinal nerve fiber layer. The method is validated through experiments on multiple datasets, demonstrating its effectiveness in enhancing diagnostic accuracy and robustness. The chapter also provides a comprehensive overview of related techniques and comparisons with other models, making it a valuable resource for professionals in the field.
  11. Initialize with Mask: For More Efficient Federated Learning

    Zirui Zhu, Lifeng Sun
    The chapter addresses the critical issue of data heterogeneity in federated learning, which slows down convergence and degrades model performance. It introduces Federated Mask (FedMask), a novel approach that retains important local parameters during initialization using Fisher Information Matrix. Additionally, it employs Maximum Mean Discrepancy constraints to stabilize the training process. Experiments on the MNIST dataset demonstrate that FedMask outperforms baseline methods, achieving a 55% faster convergence speed and a 2% improvement in model performance under heterogeneous data conditions. The chapter also proposes a new general evaluation method for federated learning, making it a valuable resource for professionals seeking to advance the state of the art in this field.
  12. Unsupervised Gaze: Exploration of Geometric Constraints for 3D Gaze Estimation

    Yawen Lu, Yuxing Wang, Yuan Xin, Di Wu, Guoyu Lu
    The chapter delves into the challenges of traditional gaze estimation methods, which often rely on intrusive equipment like head-mounted cameras. It introduces an innovative unsupervised deep learning pipeline that leverages geometric and photometric constraints to estimate 3D gaze vectors accurately. The proposed method simulates 3D reconstruction and ego-motion estimation, utilizing multi-view image sequences to refine depth maps and gaze estimation. Experimental results demonstrate the effectiveness and generalization ability of this approach, showing comparable performance to supervised methods without the need for extensive annotated data.
  13. Median-Pooling Grad-CAM: An Efficient Inference Level Visual Explanation for CNN Networks in Remote Sensing Image Classification

    Wei Song, Shuyuan Dai, Dongmei Huang, Jinling Song, Liotta Antonio
    The chapter delves into the challenges of interpreting deep learning models, particularly in remote sensing image classification. It introduces Median-Pooling Grad-CAM, a method that enhances the localization of objects in saliency maps while maintaining computational efficiency. Additionally, it proposes a new metric, confidence drop %, to evaluate the precision of visual explanations. The chapter also compares Median-Pooling Grad-CAM with other state-of-the-art techniques, demonstrating its effectiveness through extensive experiments on various datasets and CNN models. This work aims to advance the field of visual explanation methods for deep learning models, making them more interpretable and reliable.
  14. Multi-granularity Recurrent Attention Graph Neural Network for Few-Shot Learning

    Xu Zhang, Youjia Zhang, Zuyu Zhang
    The chapter introduces a Multi-granularity Recurrent Attention Graph Neural Network (MRA-GNN) designed to tackle the challenges of few-shot learning in deep learning models. Few-shot learning aims to recognize new classes with limited labeled data, a problem addressed by leveraging prior knowledge from well-labeled source classes. The MRA-GNN employs a Saliency Network to extract foreground images and a Local Proposal Network to generate local images with intra-cluster similarity and inter-cluster dissimilarity. These local images, combined with original images, are embedded into a multi-granularity graph neural network. The model's architecture allows for the propagation of label information from labeled samples to unlabeled query images, significantly improving generalization ability and reducing the risk of overfitting. The chapter also includes extensive experimental results on benchmark datasets, demonstrating the superior performance of MRA-GNN compared to state-of-the-art methods. Ablation studies further validate the effectiveness of the proposed method, highlighting its potential for real-world applications and future research directions.
  15. EEG Emotion Recognition Based on Channel Attention for E-Healthcare Applications

    Xu Zhang, Tianzhi Du, Zuyu Zhang
    The chapter delves into the significance of emotion recognition in artificial intelligence and its applications in e-healthcare. It highlights the advantages of EEG signals over other physiological signals and reviews existing methods in EEG emotion recognition. The primary focus is on a novel deep learning model, Channel Attention-based Emotion Recognition Networks (CAERN), which employs efficient channel attention mechanisms to enhance feature extraction and improve emotion recognition accuracy. The study includes extensive experiments on two emotional EEG databases, DEAP and SEED, showcasing the model's superior performance compared to traditional and state-of-the-art methods. The chapter concludes by emphasizing the potential of CAERN in addressing the challenges of EEG-based emotion recognition and its implications for e-healthcare applications.
  16. The MovieWall: A New Interface for Browsing Large Video Collections

    Marij Nefkens, Wolfgang Hürst
    The MovieWall interface is designed to address the limitations of current movie streaming services' interfaces, which often focus on targeted search and recommendations. By providing a large grid of movie posters that users can explore using zoom and pan gestures, the MovieWall offers a more engaging and exploratory browsing experience. The concept was validated through a pilot study and a detailed user study, which found that users appreciated the ability to casually browse and discover new movies. The study also explored different arrangements of the movie collection, including random, clustered, and semi-clustered layouts, to understand their impact on user experience and satisfaction. The results suggest that while users enjoyed the interface, the influence of different arrangements on browsing behavior was less apparent than expected. However, the clustered arrangement was preferred by most users, indicating a desire for some level of structure in their browsing experience.
  17. Keystroke Dynamics as Part of Lifelogging

    Alan F. Smeaton, Naveen Garaga Krishnamurthy, Amruth Hebbasuru Suryanarayana
    The chapter delves into the concept of lifelogging, which involves the automatic collection of digital records about daily activities. It highlights the potential of keystroke dynamics as a valuable data source in lifelogging, offering non-intrusive insights into behavior and trends. The authors present a dataset of keystroke and sleep data collected from four participants over six months, analyzing the consistency and variability of keystroke timing. The chapter also discusses the applications of keystroke dynamics in stress measurement, emotion detection, and writing strategy identification. Despite finding no correlation between keystroke timing and sleep score, the study underscores the need for more fine-grained measures of fatigue. The chapter concludes by advocating for greater use of keystroke dynamics in multimedia lifelogging, suggesting future research directions.
  18. HTAD: A Home-Tasks Activities Dataset with Wrist-Accelerometer and Audio Features

    Enrique Garcia-Ceja, Vajira Thambawita, Steven A. Hicks, Debesh Jha, Petter Jakobsen, Hugo L. Hammer, Pål Halvorsen, Michael A. Riegler
    The chapter introduces HTAD, a novel dataset for recognizing home task activities using wrist-accelerometer and audio data. It details the data collection process, feature extraction techniques, and the structure of the dataset, which includes activities like sweeping, brushing teeth, and watching TV. The authors present baseline experiments demonstrating the effectiveness of combining accelerometer and audio features for activity classification. The dataset is designed to support reproducibility in activity recognition research and offers a foundation for developing advanced sensor data fusion methods.
Next
  • current Page 1
  • 2
  • 3
Title
MultiMedia Modeling
Editors
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
Copyright Year
2021
Electronic ISBN
978-3-030-67835-7
Print ISBN
978-3-030-67834-0
DOI
https://doi.org/10.1007/978-3-030-67835-7

Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.

Premium Partner

    Image Credits
    Neuer Inhalt/© ITandMEDIA, Nagarro GmbH/© Nagarro GmbH, AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH, USU GmbH/© USU GmbH, Ferrari electronic AG/© Ferrari electronic AG