MultiMedia Modeling
27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II
- 2021
- Book
- Editors
- Jakub Lokoč
- Prof. Tomáš Skopal
- Prof. Dr. Klaus Schoeffmann
- Vasileios Mezaris
- Dr. Xirong Li
- Dr. Stefanos Vrochidis
- Dr. Ioannis Patras
- Book Series
- Lecture Notes in Computer Science
- Publisher
- Springer International Publishing
About this book
The two-volume set LNCS 12572 and 1273 constitutes the thoroughly refereed proceedings of the 27th International Conference on MultiMedia Modeling, MMM 2021, held in Prague, Czech Republic, in June2021.
Of the 211 submitted regular papers, 40 papers were selected for oral presentation and 33 for poster presentation; 16 special session papers were accepted as well as 2 papers for a demo presentation and 17 papers for participation at the Video Browser Showdown 2021. The papers cover topics such as: multimedia indexing; multimedia mining; multimedia abstraction and summarization; multimedia annotation, tagging and recommendation; multimodal analysis for retrieval applications; semantic analysis of multimedia and contextual data; multimedia fusion methods; multimedia hyperlinking; media content browsing and retrieval tools; media representation and algorithms; audio, image, video processing, coding and compression; multimedia sensors and interaction modes; multimedia privacy, security and content protection; multimedia standards and related issues; advances in multimedia networking and streaming; multimedia databases, content delivery and transport; wireless and mobile multimedia networking; multi-camera and multi-view systems; augmented and virtual reality, virtual environments; real-time and interactive multimedia applications; mobile multimedia applications; multimedia web applications; multimedia authoring and personalization; interactive multimedia and interfaces; sensor networks; social and educational multimedia applications; and emerging trends.
Table of Contents
-
Frontmatter
-
MSCANet: Adaptive Multi-scale Context Aggregation Network for Congested Crowd Counting
Yani Zhang, Huailin Zhao, Fangbo Zhou, Qing Zhang, Yanjiao Shi, Lanjun LiangThe chapter introduces MSCANet, a pioneering approach to congested crowd counting using an adaptive multi-scale context aggregation module (MSCA). MSCANet effectively handles challenges such as occlusion and scale variations by adaptively aggregating contextual information across different scales. This innovative method leverages atrous convolution with varying dilation rates and a channel attention mechanism to produce rich, global scene representations. The network's architecture, comprising multiple MSCA modules, allows it to efficiently process and regress crowd density maps. Extensive experiments on benchmark datasets demonstrate MSCANet's superior performance over state-of-the-art methods, particularly in congested crowd scenes. The chapter also delves into the design and effectiveness of the MSCA module, highlighting its advantages over other context-based approaches.AI Generated
This summary of the content was generated with the help of AI.
AbstractCrowd counting has achieved significant progress with deep convolutional neural networks. However, most of the existing methods don’t fully utilize spatial context information, and it is difficult for them to count the congested crowd accurately. To this end, we propose a novel Adaptive Multi-scale Context Aggregation Network (MSCANet), in which a Multi-scale Context Aggregation module (MSCA) is designed to adaptively extract and aggregate the contextual information from different scales of the crowd. More specifically, for each input, we first extract multi-scale context features via atrous convolution layers. Then, the multi-scale context features are progressively aggregated via a channel attention to enrich the crowd representations in different scales. Finally, a \(1\times 1\) convolution layer is applied to regress the crowd density. We perform extensive experiments on three public datasets: ShanghaiTech Part_A, UCF_CC_50 and UCF-QNRF, and the experimental results demonstrate the superiority of our method compared to current the state-of-the-art methods. -
Tropical Cyclones Tracking Based on Satellite Cloud Images: Database and Comprehensive Study
Cheng Huang, Sixian Chan, Cong Bai, Weilong Ding, Jinglin ZhangThis chapter introduces TCTSCI, a pioneering database designed for tracking tropical cyclones using satellite cloud images and comprehensive meteorological data. It addresses the critical need for accurate tropical cyclone forecasting by leveraging advanced satellite imagery and meteorological data. The chapter evaluates various state-of-the-art trackers on the TCTSCI database, highlighting the superior performance of deep-learning-based trackers. It also underscores the potential of multi-modal data-based trackers for enhancing tropical cyclone tracking. The evaluation results and the unique challenges posed by the database offer valuable insights for future research in this interdisciplinary field.AI Generated
This summary of the content was generated with the help of AI.
AbstractThe tropical cyclone is one of disaster weather that cause serious damages for human community. It is necessary to forecast the tropical cyclone efficiently and accurately for reducing the loss caused by tropical cyclones. With the development of computer vision and satellite technology, high quality meteorological data can be got and advanced technologies have been proposed in visual tracking domain. This makes it possible to develop algorithms to do the automatic tropical cyclone tracking which plays a critical role in tropical cyclone forecast. In this paper, we present a novel database for Typical Cyclone Tracking based on Satellite Cloud Image, called TCTSCI. To the best of our knowledge, TCTSCI is the first satellite cloud image database of tropical cyclone tracking. It consists of 28 video sequences and totally 3,432 frames with \(6001\times 6001\) pixels. It includes tropical cyclones of five different intensities distributing in 2019. Each frame is scientifically inspected and labeled with the authoritative tropical cyclone data. Besides, to encourage and facilitate research of multimodal methods for tropical cyclone tracking, TCTSCI provides not only visual bounding box annotations but multimodal meteorological data of tropical cyclones. We evaluate 11 state-of-the-art and widely used trackers by using OPE and EAO metrics and analyze the challenges on TCTSCI for these trackers. -
Image Registration Improved by Generative Adversarial Networks
Shiyan Jiang, Ci Wang, Chang HuangThe chapter 'Image Registration Improved by Generative Adversarial Networks' delves into the challenges of registering images captured by mobile phones, which are often plagued by distortions such as noise, blur, and uneven degradation. Traditional image registration methods, including template-based, domain-based, and feature-based approaches, are discussed and found to be insufficient for handling these distortions. The authors propose a novel GAN architecture that combines pixel-wise loss and content loss to enhance image quality for better registration. The chapter also introduces a synthetic loss function and a robust network structure, including a generator and discriminator network, to achieve competitive balance. Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed method in enhancing image quality and increasing the number of matching points for registration. The chapter concludes by highlighting the practical applications of the enhanced image registration in mobile phone imaging.AI Generated
This summary of the content was generated with the help of AI.
AbstractThe performances of most image registrations will decrease if the quality of the image to be registered is poor, especially contaminated with heavy distortions such as noise, blur, and uneven degradation. To solve this problem, a generative adversarial networks (GANs) based approach and the specified loss functions are proposed to improve image quality for better registration. Specifically, given the paired images, the generator network enhances the distorted image and the discriminator network compares the enhanced image with the ideal image. To efficiently discriminate the enhanced image, the loss function is designed to describe the perceptual loss and the adversarial loss, where the former measures the image similarity and the latter pushes the enhanced solution to natural image manifold. After enhancement, image features are more accurate and the registrations between feature point pairs will be more consistent. -
Deep 3D Modeling of Human Bodies from Freehand Sketching
Kaizhi Yang, Jintao Lu, Siyu Hu, Xuejin ChenThe chapter delves into the advanced techniques of deep 3D modeling of human bodies from freehand sketches, addressing the challenges posed by the non-rigid nature and articulation of human bodies. By employing deep neural networks, the authors propose a skeleton-aware interpretation neural network that effectively maps coarse and sparse sketches to high-quality body meshes. The method combines non-parametric joint regression with parametric body representation, utilizing the SMPL model to produce naturally-looking 3D models. The chapter also highlights the creation of a large-scale dataset for training and evaluating the model, demonstrating the system's effectiveness through quantitative and qualitative tests. The innovative approach enables common users to create and edit high-quality 3D body models interactively, showcasing the potential for practical applications in various fields.AI Generated
This summary of the content was generated with the help of AI.
AbstractCreating high-quality 3D human body models by freehand sketching is challenging because of the sparsity and ambiguity of hand-drawn strokes. In this paper, we present a sketch-based modeling system for human bodies using deep neural networks. Considering the large variety of human body shapes and poses, we adopt the widely-used parametric representation, SMPL, to produce high-quality models that are compatible with many further applications, such as telepresence, game production, and so on. However, precisely mapping hand-drawn sketches to the SMPL parameters is non-trivial due to the non-linearity and dependency between articulated body parts. In order to solve the huge ambiguity in mapping sketches onto the manifold of human bodies, we introduce the skeleton as the intermediate representation. Our skeleton-aware modeling network first interprets sparse joints from coarse sketches and then predicts the SMPL parameters based on joint-wise features. This skeleton-aware intermediate representation effectively reduces the ambiguity and complexity between the two high-dimensional spaces. Based on our light-weight interpretation network, our system supports interactive creation and editing of 3D human body models by freehand sketching. -
Two-Stage Real-Time Multi-object Tracking with Candidate Selection
Fan Wang, Lei Luo, En ZhuThe chapter delves into the critical challenges of real-time multi-object tracking (MOT) in computer vision, emphasizing the need for efficient data association, object detection, and appearance embedding. It introduces a two-stage deep network that not only accomplishes object detection and appearance embedding simultaneously but also employs a candidate selection process to mitigate the impact of unreliable detection results. The proposed method leverages a cascade data association strategy, utilizing spatial information and deeply learned person re-identification features to achieve accurate and real-time tracking. The chapter also highlights the experimental results on the MOT16 dataset, showcasing the method's superior performance in metrics such as MOTA and IDF1 compared to other state-of-the-art trackers. Additionally, ablation studies demonstrate the effectiveness of the candidate selection and embedding feature components in improving tracking performance. Overall, the chapter offers a comprehensive and innovative approach to real-time MOT, making it a valuable read for specialists in the field.AI Generated
This summary of the content was generated with the help of AI.
AbstractIn recent years, multi-object tracking is usually treated as a data association problem based on detection results, also known as tracking-by-detection. Such methods are often difficult to adapt to the requirements of time-critical video analysis applications which consider detection and tracking together. In this paper, we propose to accomplish object detection and appearance embedding via a two-stage network. On the one hand, we accelerate network inference process by sharing a set of low-level features and introducing a Position-Sensitive RoI pooling layer to better estimate the classification probability. On the other hand, to handle unreliable detection results produced by the two-stage network, we select candidates from outputs of both detection and tracking based on a novel scoring function which considers classification probability and tracking confidence together. In this way, we can achieve an effective trade-off between multi-object tracking accuracy and speed. Moreover, we conduct a cascade data association based on the selected candidates to form object trajectories. Extensive experiments show that each component of the tracking framework is effective and our real-time tracker can achieve state-of-the-art performance. -
Tell as You Imagine: Sentence Imageability-Aware Image Captioning
Kazuki Umemura, Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Keisuke Doman, Daisuke Deguchi, Hiroshi MuraseThe chapter 'Tell as You Imagine: Sentence Imageability-Aware Image Captioning' presents a groundbreaking method for generating image captions with different levels of visual descriptiveness. By incorporating the psycholinguistic concept of 'imageability,' the proposed model can tailor captions to various applications, such as news articles or visually impaired assistance. The method involves augmenting image caption datasets, calculating sentence imageability scores, and modifying existing image captioning models to generate diverse captions. Experiments demonstrate the effectiveness of this approach, showing that the generated captions match the desired imageability scores and are perceived as such by human evaluators. This work opens new avenues for more context-aware and application-specific image captioning systems.AI Generated
This summary of the content was generated with the help of AI.
AbstractImage captioning as a multimedia task is advancing in terms of performance in generating captions for general purposes. However, it remains difficult to tailor generated captions to different applications. In this paper, we propose a sentence imageability-aware image captioning method to generate captions tailoring to various applications. Sentence imageability describes how easily the caption can be mentally imagined. This concept is applied to the captioning model to obtain a better understanding of the perception of a generated caption. First, we extend an existing image caption dataset by augmenting its captions’ diversity. Then, a sentence imageability score for each augmented caption is calculated. A modified image captioning model is trained using this extended dataset to generate captions tailoring to a specified imageability score. Experiments showed promising results in generating imageability-aware captions. Especially, results from a subjective experiment showed that the perception of the generated captions correlates with the specified score. -
Deep Face Swapping via Cross-Identity Adversarial Training
Shuhui Yang, Han Xue, Jun Ling, Li Song, Rong XieThis chapter delves into the latest developments in face swapping technology, highlighting the challenges and limitations of existing auto-encoder based methods. It introduces a groundbreaking cross-identity adversarial training framework designed to address these issues. By incorporating spatial attention mechanisms and robust adversarial training strategies, the proposed approach achieves highly realistic face swapping results, even in complex illumination conditions. The chapter presents extensive experiments and quantitative analyses, demonstrating the superior performance of the proposed method over state-of-the-art techniques. Additionally, it explores the model's robustness in challenging scenarios such as cross-gender and cross-race face swapping, showcasing its potential in diverse applications across various industries.AI Generated
This summary of the content was generated with the help of AI.
AbstractGenerative Adversarial Networks (GANs) have shown promising improvements in face synthesis and image manipulation. However, it remains difficult to swap the faces in videos with a specific target. The most well-known face swapping method, Deepfakes, focuses on reconstructing the face image with auto-encoder while paying less attention to the identity gap between the source and target faces, which causes the swapped face looks like both the source face and the target face. In this work, we propose to incorporate cross-identity adversarial training mechanism for highly photo-realistic face swapping. Specifically, we introduce corresponding discriminator to faithfully try to distinguish the swapped faces, reconstructed faces and real faces in the training process. In addition, attention mechanism is applied to make our network robust to variation of illumination. Comprehensive experiments are conducted to demonstrate the superiority of our method over baseline models in quantitative and qualitative fashion. -
Res2-Unet: An Enhanced Network for Generalized Nuclear Segmentation in Pathological Images
Shuai Zhao, Xuanya Li, Zhineng Chen, Chang Liu, Changgen PengThe chapter delves into the challenges and advancements in nuclear segmentation within digital pathology. It introduces Res2-Unet, an enhanced U-Net architecture designed to improve the accuracy and generalization of nuclear segmentation in pathological images. The authors address the complexities of nuclear segmentation, such as varying staining, overlapping nuclei, and tumor heterogeneity. Res2-Unet incorporates advanced residual modules and attention mechanisms to enhance feature extraction and fusion, leading to superior performance in segmentation tasks. The chapter presents extensive experiments on public datasets, demonstrating the effectiveness of Res2-Unet in achieving accurate and generalized nuclear segmentation. The results highlight the potential of deep learning techniques in advancing the field of digital pathology and medical imaging.AI Generated
This summary of the content was generated with the help of AI.
AbstractThe morphology of nuclei in a pathological image plays an essential role in deriving high-quality diagnosis to pathologists. Recently, deep learning techniques have pushed forward this field significantly in the generalization ability, i.e., segmenting nuclei from different patients and organs by using the same CNN model. However, it remains challenging to design an effective network that segments nuclei accurately, due to their diverse color and morphological appearances, nuclei touching or overlapping, etc. In this paper, we propose a novel network named Res2-Unet to relief this problem. Res2-Unet inherits the contracting-expansive structure of U-Net. It is featured by employing advanced network modules such as the residual and squeeze-and-excitation (SE) to enhance the segmentation capability. The residual module is utilized in both contracting and expansive paths for comprehensive feature extraction and fusion, respectively. While the SE module enables selective feature propagation between the two paths. We evaluate Res2-Unet on two public nuclei segmentation benchmarks. The experiments show that by equipping the modules individually and jointly, performance gains are consistently observed compared to the baseline and several existing methods. -
Automatic Diagnosis of Glaucoma on Color Fundus Images Using Adaptive Mask Deep Network
Gang Yang, Fan Li, Dayong Ding, Jun Wu, Jie XuThe chapter delves into the critical issue of glaucoma diagnosis, a leading cause of vision loss worldwide. It highlights the challenges in manual diagnosis and the potential of AI in addressing these issues. The authors introduce AMNet, a deep learning model that incorporates an attention mechanism to focus on key diagnostic areas such as the optic disc and retinal nerve fiber layer. The method is validated through experiments on multiple datasets, demonstrating its effectiveness in enhancing diagnostic accuracy and robustness. The chapter also provides a comprehensive overview of related techniques and comparisons with other models, making it a valuable resource for professionals in the field.AI Generated
This summary of the content was generated with the help of AI.
AbstractGlaucoma, a disease characterized by the progressive and irreversible defect of the visual field, requires a lifelong course of treatment once it is confirmed, which highlights the importance of glaucoma early detection. Due to the diversity of glaucoma diagnostic indicators and the diagnostic uncertainty of ophthalmologists, deep learning has been applied to glaucoma diagnosis by automatically extracting characteristics from color fundus images, and that has achieved great performance recently. In this paper, we propose a novel adaptive mask deep network to obtain effective glaucoma diagnosis on retinal fundus images, which fully utilizes the prior knowledge of ophthalmologists on glaucoma diagnosis to synthesize attention masks of color fundus images to locate a reasonable region of interest. Based on the synthesized masks, our method could pay careful attention to the effective visual representation of glaucoma. Experiments on several public and private fundus datasets illustrate that our method could focus on the significant area of glaucoma diagnosis and simultaneously achieve great performance in both academic environments and practical medical applications, which provides a useful contribution to improve the automatic diagnosis of glaucoma. -
Initialize with Mask: For More Efficient Federated Learning
Zirui Zhu, Lifeng SunThe chapter addresses the critical issue of data heterogeneity in federated learning, which slows down convergence and degrades model performance. It introduces Federated Mask (FedMask), a novel approach that retains important local parameters during initialization using Fisher Information Matrix. Additionally, it employs Maximum Mean Discrepancy constraints to stabilize the training process. Experiments on the MNIST dataset demonstrate that FedMask outperforms baseline methods, achieving a 55% faster convergence speed and a 2% improvement in model performance under heterogeneous data conditions. The chapter also proposes a new general evaluation method for federated learning, making it a valuable resource for professionals seeking to advance the state of the art in this field.AI Generated
This summary of the content was generated with the help of AI.
AbstractFederated Learning (FL) is a machine learning framework proposed to utilize the large amount of private data of edge nodes in a distributed system. Data at different edge nodes often shows strong heterogeneity, which makes the convergence speed of federated learning slow and the trained model does not perform well at the edge. In this paper, we propose Federated Mask (FedMask) to address this problem. FedMask uses Fisher Information Matrix (FIM) as a mask when initializing the local model with the global model to retain the most important parameters for the local task in the local model. Meanwhile, FedMask uses Maximum Mean Discrepancy (MMD) constraint to avoid the instability of the training process. In addition, we propose a new general evaluation method for FL. Following experiments on MNIST dataset show that our method outperforms the baseline method. When the edge data is heterogeneous, the convergence speed of our method is 55% faster than that of the baseline method, and the performance is improved by 2%. -
Unsupervised Gaze: Exploration of Geometric Constraints for 3D Gaze Estimation
Yawen Lu, Yuxing Wang, Yuan Xin, Di Wu, Guoyu LuThe chapter delves into the challenges of traditional gaze estimation methods, which often rely on intrusive equipment like head-mounted cameras. It introduces an innovative unsupervised deep learning pipeline that leverages geometric and photometric constraints to estimate 3D gaze vectors accurately. The proposed method simulates 3D reconstruction and ego-motion estimation, utilizing multi-view image sequences to refine depth maps and gaze estimation. Experimental results demonstrate the effectiveness and generalization ability of this approach, showing comparable performance to supervised methods without the need for extensive annotated data.AI Generated
This summary of the content was generated with the help of AI.
AbstractEye gaze estimation can provide critical evidence for people attention, which has extensive applications on cognitive science and computer vision areas, such as human behavior analysis and fake user identification. Existing typical methods mostly place the eye-tracking sensors directly in front of the eyeballs, which is hard to be utilized in the wild. And recent learning-based methods require prior ground truth annotations of gaze vector for training. In this paper, we propose an unsupervised learning-based method for estimating the eye gaze in 3D space. Building on top of the existing unsupervised approach to regress shape parameters and initialize the depth, we propose to apply geometric spectral photometric consistency constraint and spatial consistency constraints across multiple views in video sequences to refine the initial depth values on the detected iris landmark. We demonstrate that our method is able to learn gaze vector in the wild scenes more robust without ground truth gaze annotations or 3D supervision, and show our system leads to a competitive performance compared with existing supervised methods. -
Median-Pooling Grad-CAM: An Efficient Inference Level Visual Explanation for CNN Networks in Remote Sensing Image Classification
Wei Song, Shuyuan Dai, Dongmei Huang, Jinling Song, Liotta AntonioThe chapter delves into the challenges of interpreting deep learning models, particularly in remote sensing image classification. It introduces Median-Pooling Grad-CAM, a method that enhances the localization of objects in saliency maps while maintaining computational efficiency. Additionally, it proposes a new metric, confidence drop %, to evaluate the precision of visual explanations. The chapter also compares Median-Pooling Grad-CAM with other state-of-the-art techniques, demonstrating its effectiveness through extensive experiments on various datasets and CNN models. This work aims to advance the field of visual explanation methods for deep learning models, making them more interpretable and reliable.AI Generated
This summary of the content was generated with the help of AI.
AbstractGradient-based visual explanation techniques, such as Grad-CAM and Grad-CAM++ have been used to interpret how convolutional neural networks make decisions. But not all techniques can work properly in the task of remote sensing (RS) image classification. In this paper, after analyzing why Grad-CAM performs worse than Grad-CAM++ for RS images classification from the perspective of weight matrix of gradients, we propose an efficient visual explanation approach dubbed median-pooling Grad-CAM. It uses median pooling to capture the main trend of gradients and approximates the contributions of feature maps with respect to a specific class. We further propose a new evaluation index, confidence drop %, to express the degree of drop of classification accuracy when occluding the important regions that are captured by the visual saliency. Experiments on two RS image datasets and for two CNN models of VGG and ResNet, show our proposed method offers a good tradeoff between interpretability and efficiency of visual explanation for CNN-based models in RS image classification. The low time-complexity median-pooling Grad-CAM could provide a good complement to the gradient-based visual explanation techniques in practice. -
Multi-granularity Recurrent Attention Graph Neural Network for Few-Shot Learning
Xu Zhang, Youjia Zhang, Zuyu ZhangThe chapter introduces a Multi-granularity Recurrent Attention Graph Neural Network (MRA-GNN) designed to tackle the challenges of few-shot learning in deep learning models. Few-shot learning aims to recognize new classes with limited labeled data, a problem addressed by leveraging prior knowledge from well-labeled source classes. The MRA-GNN employs a Saliency Network to extract foreground images and a Local Proposal Network to generate local images with intra-cluster similarity and inter-cluster dissimilarity. These local images, combined with original images, are embedded into a multi-granularity graph neural network. The model's architecture allows for the propagation of label information from labeled samples to unlabeled query images, significantly improving generalization ability and reducing the risk of overfitting. The chapter also includes extensive experimental results on benchmark datasets, demonstrating the superior performance of MRA-GNN compared to state-of-the-art methods. Ablation studies further validate the effectiveness of the proposed method, highlighting its potential for real-world applications and future research directions.AI Generated
This summary of the content was generated with the help of AI.
AbstractFew-shot learning aims to learn a classifier that classifies unseen classes well with limited labeled samples. Existing meta learning-based works, whether graph neural network or other baseline approaches in few-shot learning, has benefited from the meta-learning process with episodic tasks to enhance the generalization ability. However, the performance of meta-learning is greatly affected by the initial embedding network, due to the limited number of samples. In this paper, we propose a novel Multi-granularity Recurrent Attention Graph Neural Network (MRA-GNN), which employs Multi-granularity graph to achieve better generalization ability for few-shot learning. We first construct the Local Proposal Network (LPN) based on attention to generate local images from foreground images. The intra-cluster similarity and the inter-cluster dissimilarity are considered in the local images to generate discriminative features. Finally, we take the local images and original images as the input of multi-grained GNN models to perform classification. We evaluate our work by extensive comparisons with previous GNN approaches and other baseline methods on two benchmark datasets (i.e., miniImageNet and CUB). The experimental study on both of the supervised and semi-supervised few-shot image classification tasks demonstrates the proposed MRA-GNN significantly improves the performances and achieves the state-of-the-art results we know. -
EEG Emotion Recognition Based on Channel Attention for E-Healthcare Applications
Xu Zhang, Tianzhi Du, Zuyu ZhangThe chapter delves into the significance of emotion recognition in artificial intelligence and its applications in e-healthcare. It highlights the advantages of EEG signals over other physiological signals and reviews existing methods in EEG emotion recognition. The primary focus is on a novel deep learning model, Channel Attention-based Emotion Recognition Networks (CAERN), which employs efficient channel attention mechanisms to enhance feature extraction and improve emotion recognition accuracy. The study includes extensive experiments on two emotional EEG databases, DEAP and SEED, showcasing the model's superior performance compared to traditional and state-of-the-art methods. The chapter concludes by emphasizing the potential of CAERN in addressing the challenges of EEG-based emotion recognition and its implications for e-healthcare applications.AI Generated
This summary of the content was generated with the help of AI.
AbstractEmotion recognition based on EEG is a critical issue in Brain-Computer Interface (BCI). It also plays an important role in the e-healthcare systems, especially in the detection and treatment of patients with depression by classifying the mental states. Unlike previous works that feature extraction using multiple frequency bands leads to a redundant use of information, where similar and noisy features extracted. In this paper, we attempt to overcome this limitation with the proposed architecture, Channel Attention-based Emotion Recognition Networks (CAERN). It can capture more critical and effective EEG emotional features based on the use of attention mechanisms. Further, we employ deep residual networks (ResNets) to capture richer information and alleviate gradient vanishing. We evaluate the proposed model on two datasets: DEAP database and SJTU emotion EEG database (SEED). Compared to other EEG emotion recognition networks, the proposed model yields better performance. This demonstrates that our approach is capable of capturing more effective features for EEG emotion recognition. -
The MovieWall: A New Interface for Browsing Large Video Collections
Marij Nefkens, Wolfgang HürstThe MovieWall interface is designed to address the limitations of current movie streaming services' interfaces, which often focus on targeted search and recommendations. By providing a large grid of movie posters that users can explore using zoom and pan gestures, the MovieWall offers a more engaging and exploratory browsing experience. The concept was validated through a pilot study and a detailed user study, which found that users appreciated the ability to casually browse and discover new movies. The study also explored different arrangements of the movie collection, including random, clustered, and semi-clustered layouts, to understand their impact on user experience and satisfaction. The results suggest that while users enjoyed the interface, the influence of different arrangements on browsing behavior was less apparent than expected. However, the clustered arrangement was preferred by most users, indicating a desire for some level of structure in their browsing experience.AI Generated
This summary of the content was generated with the help of AI.
AbstractStreaming services offer access to huge amounts of movie and video collections, resulting in the need for intuitive interaction designs. Yet, most current interfaces are focused on targeted search, neglecting support for interactive data exploration and prioritizing speed over experience. We present the MovieWall, a new interface that complements such designs by enabling users to randomly browse large movie collections. A pilot study proved the feasibility of our approach. We confirmed this observation with a detailed evaluation of an improved design, which received overwhelmingly positive subjective feedback; 80% of the subjects enjoyed using the application and even more stated that they would use it again. The study also gave insight into concrete characteristics of the implementation, such as the benefit of a clustered visualization. -
Keystroke Dynamics as Part of Lifelogging
Alan F. Smeaton, Naveen Garaga Krishnamurthy, Amruth Hebbasuru SuryanarayanaThe chapter delves into the concept of lifelogging, which involves the automatic collection of digital records about daily activities. It highlights the potential of keystroke dynamics as a valuable data source in lifelogging, offering non-intrusive insights into behavior and trends. The authors present a dataset of keystroke and sleep data collected from four participants over six months, analyzing the consistency and variability of keystroke timing. The chapter also discusses the applications of keystroke dynamics in stress measurement, emotion detection, and writing strategy identification. Despite finding no correlation between keystroke timing and sleep score, the study underscores the need for more fine-grained measures of fatigue. The chapter concludes by advocating for greater use of keystroke dynamics in multimedia lifelogging, suggesting future research directions.AI Generated
This summary of the content was generated with the help of AI.
AbstractIn this paper we present the case for including keystroke dynamics in lifelogging. We describe how we have used a simple keystroke logging application called Loggerman, to create a dataset of longitudinal keystroke timing data spanning a period of up to seven months for four participants. We perform a detailed analysis of this data by examining the timing information associated with bigrams or pairs of adjacently-typed alphabetic characters. We show how the amount of day-on-day variation of the keystroke timing among the top-200 bigrams for participants varies with the amount of typing each would do on a daily basis. We explore how daily variations could correlate with sleep score from the previous night but find no significant relationship between the two. Finally we describe the public release of a portion of this data and we include a series of pointers for future work including correlating keystroke dynamics with mood and fatigue during the day. -
HTAD: A Home-Tasks Activities Dataset with Wrist-Accelerometer and Audio Features
Enrique Garcia-Ceja, Vajira Thambawita, Steven A. Hicks, Debesh Jha, Petter Jakobsen, Hugo L. Hammer, Pål Halvorsen, Michael A. RieglerThe chapter introduces HTAD, a novel dataset for recognizing home task activities using wrist-accelerometer and audio data. It details the data collection process, feature extraction techniques, and the structure of the dataset, which includes activities like sweeping, brushing teeth, and watching TV. The authors present baseline experiments demonstrating the effectiveness of combining accelerometer and audio features for activity classification. The dataset is designed to support reproducibility in activity recognition research and offers a foundation for developing advanced sensor data fusion methods.AI Generated
This summary of the content was generated with the help of AI.
AbstractIn this paper, we present HTAD: A Home Tasks Activities Dataset. The dataset contains wrist-accelerometer and audio data from people performing at-home tasks such as sweeping, brushing teeth, washing hands, or watching TV. These activities represent a subset of activities that are needed to be able to live independently. Being able to detect activities with wearable devices in real-time is important for the realization of assistive technologies with applications in different domains such as elderly care and mental health monitoring. Preliminary results show that using machine learning with the presented dataset leads to promising results, but also there is still improvement potential. By making this dataset public, researchers can test different machine learning algorithms for activity recognition, especially, sensor data fusion methods.
- Title
- MultiMedia Modeling
- Editors
-
Jakub Lokoč
Prof. Tomáš Skopal
Prof. Dr. Klaus Schoeffmann
Vasileios Mezaris
Dr. Xirong Li
Dr. Stefanos Vrochidis
Dr. Ioannis Patras
- Copyright Year
- 2021
- Publisher
- Springer International Publishing
- Electronic ISBN
- 978-3-030-67835-7
- Print ISBN
- 978-3-030-67834-0
- DOI
- https://doi.org/10.1007/978-3-030-67835-7
Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.