Digital Multimedia Communications
21st International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2024, Hainan, China, November 28–29, 2024, Revised Selected Papers, Part I
- 2025
- Buch
- Herausgegeben von
- Guangtao Zhai
- Jun Zhou
- Long Ye
- Hua Yang
- Ping An
- Verlag
- Springer Nature Singapore
Über dieses Buch
Über dieses Buch
This volume contains 28 selected papers presented at IFTC 2024: 21st International Forum of Digital Multimedia Communication, held in Lingshui, Hainan, China, on November 28-29, 2024.
The 55 full papers included in this 2-volume set were carefully reviewed and selected from 146 submissions. They were organized in topical sections as follows:
CCIS 2441: Affective Computing, Graphics & Image Processing for Virtual Reality, Large Language Models, Multimedia Communication, Application of Deep Learning and Video Analysis.
CCIS 2442: Human and Interactive Media, Image Processing, Quality Assessment and Source Coding.
Inhaltsverzeichnis
-
Frontmatter
-
Affective Computing
-
Frontmatter
-
Spatio-Temporal Scene Graph Reasoning Networks for Emotion Recognition in User-Generated Videos
Di Lu, Xiaobing Li, Yuhang Lu, Qingwen Zhou, Yun TieAbstractThis paper proposes a video emotion recognition framework based on visual relationship reasoning between objects and modal-aware fusion. To depict the emotional relationship between the region in several video frames between different regions in the same video frame, we create spatio-temporal scene graphs for object branch. Using the Graphormer to encode spatiotemporal scene graphs and quantify the emotional intensity between various objects, we introduce three structural information encoding methods. To achieve the goal of visual scene branch providing global context information for object branch, we propose a knowledge distillation mechanism based on object-aware. We use the Channel Temporal Attention Mechanism (CTAM) for audio streams to improve the characteristics of the information spectrum frames. Finally, we project the acoustic and visual features to two different subspaces to learn the features. The experimental results on Video Emotion-8 and Ekman-6 datasets prove the effectiveness of our proposed model.
-
-
Graphics and Image Processing for Virtual Reality
-
Frontmatter
-
Unsupervised 3D Face Reconstruction Method Based on ITV-Net
Tingze Zhang, Hui Li, Jun ZhouAbstractITV-Net (ImageTransformer and VarEncoder) proposes a new approach to address the challenge of scarce ground truth data in 3D face reconstruction. By integrating Transformer and Variational Autoencoder (VAE) for encoding and decoding, and introducing noise perturbations in the latent space, the method enhances feature diversity and representation. Utilizing deep learning and the 3DMM model, it enables fast 3D face reconstruction. In perceptual loss, normal consistency and reflection losses are incorporated to constrain the geometric structure in 3D reconstruction and enhance lighting reflection accuracy in the projected 2D images. Experiments show that the method performs excellently in complex scenes, especially in cases with large pose variations. -
Attention-Guided Semantic Segmentation Network for High-Dimensional Multi-scale Land Remote Sensing
Guie Jiao, Qinbing GeAbstractAiming at the problems of edge feature recovery, high-dimensional feature extraction and context information fusion in remote sensing image segmentation, this paper proposes an improved multi-scale high-dimensional feature fusion remote sensing image semantic segmentation network (MHS-UNet). Based on U-Net network, the multi-branch convolutional attention mechanism and high-dimensional feature extraction module are introduced into the encoder, which effectively enhances the multi-level capture of remote sensing image features. The dynamic upsampling feature fusion method is used in the decoder to avoid the information loss caused by feature fusion, and to improve the fusion effect of shallow and deep features. Comparison and ablation experiments were carried out on LoveDA and WHDLD datasets. The results showed that the mIoU of MHS-UNet on the two datasets reached 58.60% and 63.67%, respectively, which effectively improved the semantic segmentation accuracy of remote sensing images. -
Reversible Data Hiding for Encrypted 3D Mesh Model Based on Optimal Grouping Strategy and Multiple-Bit Plane Prediction
Kai Hu, Li Liu, Anhong Wang, Shijie MaAbstractThe transmission and storage of massive 3D models in cloud space offer significant convenience to human life, however, they also pose significant security threats. To address this, we propose a large-capacity Reversible data hiding in the encrypted domain (RDH-ED) algorithm for 3D mesh model based on optimal grouping strategy and multiple-bit plane prediction. Firstly, the proposed algorithm utilizes the optimal grouping strategy to group the vertices of the 3D mesh model. Secondly, in each group, a multiple-bit plane prediction algorithm is employed to vacate redundant rooms for embedding secret data. Finally, the receiver uses the embedded flag bit and length label within the group to achieve lossless model recovery. Experimental results demonstrate that the proposed algorithm is completely reversible, with a secret data extraction error rate of 0, and exhibits a higher embedding capacity compared to existing algorithms. -
Semantic-Driven Free-View 3D Human Motion Video Composite
Shaolin Wang, Xinyan Yang, Kai Yang, Long Ye, Qin ZhangAbstractWe propose a text-to-video synthesis method with controllable free perspective. This method performs well for human-centered textual descriptions and ensures the stability and realism of the person's motion from any perspective in the generated videos. Specifically, we analyze the input textual description and decompose it into three retrieval instructions. Different retrieval methods are applied based on the type of retrieval, targeting the key elements from our constructed resource library, including people, backgrounds, and motion sequences. These retrieval methods ensure semantic consistency in video synthesis. Furthermore, we construct a foreground character library using multi-view RGB images and leverage the advantages of 3D reconstruction to implicitly model the retrieved foreground characters. This ensures the stability of the characters during video synthesis and enables free-perspective transformations. To address the limitation of existing methods in generating complex motions, we employ real motion sequences to drive the reconstruction, achieving video synthesis of arbitrary duration. Experimental results demonstrate that our method outperforms existing open-source models across multiple metrics. -
RG-GS: Rasterization-Enhanced and Geometric-Guided Gaussian Splatting
Bo Liu, Shengfan Wang, Hongyu Jin, Fei Hu, Li Fang, Wei ZhongAbstractNeural Radiance Fields (NeRF) have led to substantial advancements in 3D content generation and rendering techniques. Among these, 3D Gaussian Splatting (3DGS) has become a pivotal approach in computer vision, specifically for scene reconstruction and representation. This paper addresses critical challenges within the 3D Gaussian Splatting algorithm, primarily focusing on issues related to insufficient detail and geometric inaccuracies. We introduce a novel rendering method, Rasterization-Enhanced and Geometric-Guided Gaussian Splatting (RG-GS), which combines enhanced rasterization and geometric guidance to address these limitations. Our approach efficiently approximates ellipses in Gaussian rasterization using area-similar and shape-similar tiles, reducing computational costs while maintaining fine details. Additionally, we incorporate depth information into 3DGS by employing a depth map as a global geometric supervisory signal, guiding the training process to improve geometric reconstruction accuracy. Experimental results demonstrate that our method substantially improves fine texture handling, delivering more vibrant and detailed colors with realistic lighting effects, all while minimizing geometric errors. -
MoT: A Mixture of TriPlanes Framework for Frequency-Aware Dynamic Neural Radiance Fields
Zhiwei Liu, Shengfan Wang, Fei Hu, Wei Zhong, Li FangAbstractDynamic Neural Radiance Fields (Dynamic NeRF) have become increasingly important for 3D scene reconstruction, particularly in modeling dynamic environments. However, they often face challenges in rendering high-frequency details with sufficient quality. To address this issue, we propose a novel frequency-aware approach using the Mixture of TriPlanes (MoT) framework. Inspired by the Mixture of Experts (MoE) paradigm, our method combines high-low frequency and dynamic-static triplanes, allowing for adaptive handling of different scene regions. This design enables specialized triplanes to process varying frequency and temporal components, resulting in more precise and flexible 3D reconstructions. Additionally, we propose a frequency-based feature fusion mechanism that dynamically adjusts weights for blending high and low-frequency information, improving the representation of complex scenes. Extensive experiments validate the effectiveness of our approach, demonstrating significant improvements over existing dynamic NeRF methods, particularly in capturing high-frequency details and dynamic elements. -
High Quality 3D Gaussian Avatar Modeling
Xinglong Peng, Daquan Feng, Qi Zheng, Xiaolin Wei, Jiaxin Lin, Fang DiAbstractCreating animatable 3D avatars from monocular videos is a promising topic with broad applications in the virtual realm. Recent 3D Gaussian splatting methods have shown advantages in training and inference compared to previous studies. However, they either zero the opacity or learn the background color at an inaccurate initial position. In this paper, we propose a Gaussian human avatar framework that decouples the estimation of position and color and balances the learning of low-level pixel intensity and high-level semantics details. The framework estimates the position offset via a Position Net and the color information via a Texture Net. By decoupling, the network will neither zero the opacity nor learn the background color at an inaccurate initial position, and can learn the exact 3D Gaussian position on a preset number of Gaussians. Besides, we introduce an adaptive thresholding strategy that dynamically shrinks the LPIPS loss during training. This strategy balances the learning of low-level pixel intensity and high-level semantics. Experimental results on public datasets show that our method achieves better quality of appearance and shorter training time. -
Content Adaptive Light Field Representation Using Fourier Disparity Layers
Wenjing Ling, Xinpeng Huang, Yaqian Luo, Ping An, Chao YangAbstractLight field (LF) data are widely used in the immersive representations of the 3D world. Due to the vast amount of information they contain, light field data pose significant challenges for compression. The Fourier Disparity Layer (FDL) representation offers an effective method for light field compact representation. However, the high redundancy of information across viewpoints makes using all viewpoints as input for constructing the FDL inefficient. Therefore, it is crucial to construct FDL representation from sparse viewpoints. Given the varying characteristics of different scene contents, a content adaptive low-rank algorithm is proposed to optimize the FDL representation based on principal component analysis. In this way, a sparse set of viewpoints containing sufficient scene information is selected as the input for FDL representation. The proposed method demonstrates substantial robustness across diverse scene contents and highlights the significant benefits of flexible sampling in enhancing the efficiency of FDL representation.
-
-
Large Language Models
-
Frontmatter
-
Research on Legal Question Answering System with Retrieval-Augmented Large Language Models
Nuo Xu, Siben Li, Yufan XiaAbstractThis paper introduces a Legal Question Answering (LQA) dataset, consisting of 10,000 annotated legal question-answer pairs in Chinese. We build the legal question-answer system on LQA datasets by implementing the “retrieve-then-read” pipeline, which could offer answers grounded in pertinent legal statutes. The experimental results have validated the efficacy of the data set in enhancing the performance of the legal question-answer system. The constructed LQA datasets could be used to train and refine LQA systems to better understand and respond to legal questions, thus enhancing the capabilities of AI in the legal domain. -
Joint Source-Channel Coding with Large Language Model: A Vibrotactile Example
Shuijie Li, Kemi Chen, Runjie Wang, Weiling Chen, Tiesong ZhaoAbstractRecent advancements in tactile and communication technologies have created new opportunities for immersive virtual reality and remote-control applications. However, a key challenge is developing communication systems that are both efficient and adaptable to changing channel conditions. Achieving seamless, real-time tactile feedback requires advanced encoding strategies that adjust to dynamic factors such as bandwidth, latency, and packet loss. To address these challenges, this paper leverages the capabilities of the Large Language Model (LLM) ChatGPT 4.0, positioning it as an intelligent decision-making component in joint source-channel coding (JSCC) for vibrotactile data transmission. The model can process and respond to real-time channel variations, enabling adaptive encoding decisions. This optimization aligns with system conditions to enhance responsiveness and adaptability. This approach also simplifies development and maintenance while improving scalability and transmission efficiency. Furthermore, we propose a novel deep JSCC framework that integrates modules for semantic extraction, encoding, decoding, and channel feedback. The semantic extraction module effectively captures the essential features of vibrotactile data, while the channel adaptation module ensures resilience to noise. Comparative experiments on the IEEE P1918.1.1 haptic codec task force dataset demonstrate that model outperforms traditional separate communication schemes and existing JSCC methods. This study offers a new perspective on integrating LLM with JSCC, advancing tactile communication technology for next-generation immersive experiences. -
Persona Extraction and Integration with Large Language Models Towards Personalized Dialogues
Xiaoru Qin, Kaihui Mu, Jiaojiao LiAbstractPersonalized dialogue systems are widely recognized for generating responses that reflect specific personas. However, existing approaches predominantly rely on predefined persona information, which not only requires substantial upfront manual annotation efforts but also struggles to adapt to the dynamic changes in persona. To address these issues, we propose Persona Extraction and Integration (PEI), a two-stage framework based on Large Language Models (LLMs) and LoRA fine-tuning. This framework aims to dynamically capture and integrate persona from dialogue history without predefined persona information, thereby optimizing the effectiveness of personalized dialogue generation. Experimental results show that PEI outperforms baseline models on both Chinese and English personalized dialogue datasets, confirming its superiority in personalized generation tasks.
-
-
Multimedia Communication
-
Frontmatter
-
MSBA: Adaptive Multi-Stream Data Transmission Method with Bandwidth Awareness for End-Cloud Systems
Qi Guo, Zheming Yang, Chang Zhao, Wen JiAbstractWith the rapid growth of the Internet and smart cities, video data has become a primary contributor to total Internet traffic, making video transmission technology essential in modern information systems. However, large-scale video uploads to the cloud are hindered by dynamic network conditions and data redundancy, limiting transmission performance. This paper proposes an adaptive multi-stream transmission method with bandwidth awareness (MSBA) for End-Cloud systems. MSBA introduces feature-compressed streams and semantic video streams, categorizing network environments by available bandwidth and fluctuation levels to adaptively transmit appropriate data streams. This approach enables the cloud to efficiently process video data under various network conditions by maximizing bandwidth utilization. Experimental results demonstrate that MSBA effectively maintains visual focus area quality while achieving high compression rates, reducing video transmission delay by 57.22% to 84.98% com-pared to baseline methods. Overall, our solution can effectively reduce the video transmission delay while maintaining the SSIM of the focus area in the video.
-
-
Application of Deep Learning
-
Frontmatter
-
Fabric Defect Detection Method Based on Unlabeled Compact Deep Learning
Kezhen Lin, Hongwei Sun, Fengnong ChenAbstractTraditional fabric defect detection methods often rely on manual inspection or machine vision systems based on hand-crafted features. Deep learning models typically require large amounts of labeled data for training, and in practical applications, obtaining a large amount of high-quality labeled data is both challenging and costly. Therefore, this study combines deep learning and image processing techniques to propose a fabric defect detection method based on unlabeled compact deep learning. First, we train Convolutional Neural Network (CNN) compact deep learning using a pre-processed fabric defect sample dataset containing 18 categories to obtain an unlabeled deep learning model for fabric defect detection. Then a fabric defect detection method is developed based on this model, which is capable of performing detection of fabric image samples. Subsequently, the proposed method was validated using a real industrial fabric image dataset. The experimental results show that the method based on the unlabeled compact deep learning model improves the detection accuracy and efficiency by approximately 80% compared to traditional machine learning. Moreover, this method does not rely on a large amount of labeled data, offering better adaptability and broad application prospects. -
Causal Imitation Learning-Based Navigation Algorithm for Drones
Tao Sun, Jiaojiao Gu, Junjie MouAbstractAs quadcopters become essential in fields like power line inspections and aerial photography, the need for autonomous obstacle avoidance navigation grows, yet achieving human-level flexibility and safety remains a significant challenge. To solve this challenge, we propose a novel approach to autonomous obstacle avoidance for quadrotor drones in low-altitude environments using a causal imitation learning-based navigation algorithm. An improved A* algorithm is employed to generate expert trajectories that prioritize safety by optimizing the heuristic function to account for both the goal distance and proximity to obstacles. This modification reduces the aggressive path selection issue inherent in traditional A*, resulting in safer navigation and higher success rates during high-speed flight in complex environments. To further enhance generalization, a causal structure graph is constructed to address causal confounding in sequential image data. A causal structure search algorithm, based on the actor-critic method, effectively identifies hidden confounders, improving the apprentice network’s performance in both training and test environments. This approach significantly boosts the robustness and generalizability of the learned policy, making it suitable for real-world deployment. -
Memory-Guided Hierarchical Feature Reconstruction for Multi-class Unsupervised Anomaly Detection
Kai Huang, Shubo Zhou, Weiyu Hu, Yongbin Gao, Feng Pan, Xue-Qin JiangAbstractUnsupervised anomaly detection methods have made significant advancements in addressing real-world industrial anomaly detection tasks. Among these, feature reconstruction-based approaches have shown exceptional performance, particularly in terms of accuracy and real-time processing capabilities. However, in the more practical multi-class anomaly detection scenarios, these methods may fall into an “identical shortcut”, where the model simply returns a copy of the input, resulting in the anomaly features being effectively reconstructed as well. To overcome this, we propose a Memory-guided Hierarchical Feature Reconstruction method for Multi-class Unsupervised Anomaly Detection. Firstly, we employ a Memory-guided Feature Alignment (MFA) module to align deep features of normal samples, preventing the “identical shortcut” problem and avoiding the reconstruction of anomalous features. Secondly, we introduce a Position-Aware Spatial Attention (PASA) mechanism to compensate for the loss of positional information in the shallow decoder, enabling improved hierarchical feature reconstruction. We validated the effectiveness of our approach on the MVTec and MVTec LOCO datasets, achieving AUROC scores of 98.6% and 84.2%, respectively, surpassing state-of-the-art methods. -
A Method for Surface Defect Detection Based on Denoising and Self-supervised Reconstruction
An Xing, Shubo Zhou, Xue-Qin Jiang, Zhijun Fang, Huanchun PengAbstractGiven the rarity and diversity of anomalies, comprehensive anomaly type collection is infeasible. Researchers thus rely on unsupervised learning techniques trained solely on normal samples. Recently, S-T framework-based methods have shown promising results. However, they struggle with detecting structural anomalies resembling normal samples. To tackle this issue, we introduce a self-supervised reconstruction module in the student network’s final layer, which masks the central part of the receptive field and leverages contextual information to predict the masked values. This design encourages the model to learn and utilize surrounding contextual information, enabling a deeper understanding of the intrinsic structure of normal samples and, consequently, improving the detection of potential anomalies. We assess the effectiveness of our approach on the MVTECAD and MVTEC Loco AD datasets. Our experimental results demonstrate that our method achieves state-of-the-art average performance on both datasets. -
Transmission Line Bolt Missing Detection Based on Improved YOLOv8 Network
Shounan Bao, Chaofeng LiAbstractTransmission lines are essential infrastructure for power supply in modern society. Bolts, as the main fasteners in transmission lines, are crucial for their proper functioning. This paper proposes an improved YOLOv8-based method for detecting bolt defects in transmission lines. Firstly, a dataset for transmission line bolt defects was constructed, and data augmentation techniques were applied to generate more training samples and increase data diversity. Secondly, due to the high complexity of the background in the data samples, which affects the model's feature representation capability, DynamicConv (Dynamic Convolution Layer) was used to better capture dynamic features in the input data, thereby enhancing the model's feature representation ability. Finally, the ShuffleAttention mechanism was employed, combining channel shuffle and attention mechanism ideas to promote cross-learning between features, improving the model’s representational and generalization capabilities. This mechanism adaptively learns the importance of different features and adjusts feature weights, accordingly, enabling more effective feature fusion and information transfer. Experimental results show that the improved YOLOv8 model performs excellently in detecting bolt defects in transmission lines. Specifically, the improved model achieved a mAP of 86.5% on the self-built dataset, representing a 6.8% increase over the previous version. -
A Lightweight Infrared and Visible Image Fusion Method for Object Detection
Chang Zhang, Wen JiAbstractInfrared images provide stable visual information in complex environments where visible light imaging performs poorly, such as in low light conditions and adverse weather. Many image fusion studies focus on fusing infrared and visible images to achieve complementary multi-source visual information, thereby enhancing overall image quality and information completeness. Meanwhile, with the fast-paced advancement of deep learning-driven technology, computer vision intelligent algorithms are increasingly replacing traditional human vision due to their high accuracy and real-time capabilities, undertaking a large amount of visual data analysis. Against this background, this paper selects object detection as a representative computer vision task and proposes a fusion method for infrared and visible images aimed at object detection with lightweight. The proposed approach is mainly composed of three components: feature extraction, feature fusion, and image reconstruction. During training, an object detection model is incorporated to optimize the fused image through object detection loss, thereby improving its adaptability and effectiveness for object detection tasks. Experimental results demonstrate that the proposed infrared and visible light fusion method is highly effective for object detection tasks, achieving better detection performance than using only infrared or visible light images. Additionally, the lightweight network model has low parameter complexity, which is well-suited for edge devices and real-time applications. -
MSO-YOLO: Real-Time Pedestrian Detection Algorithm on Multi-scale and Occlusion Situation
Tong Zhou, Fangfang Lu, Huiqun Yu, Sangyu Yao, Guxue Sun, Yijie HuangAbstractPedestrian detection, as an essential part of object detection, has widespread applications such as automatic driving, construction safety monitoring, and so on. However, occlusion and multi-scale situation form additional difficulty to detect pedestrians. ITo tackle the difficulties posed by multi-scale and occlusion in pedestrian detection, this paper proposes an enhanced version of the You Only Look Once (YOLO) algorithm MSO-YOLO, specifically tailored for Multi-Scale and Occlusion situations. The proposed model enhances pedestrian detection performance in situations of occlusion and multi-scale objects while maintaining real-time detection. We introduce multi-scale block in the backbone for better multi-scale features extraction. In the neck of the model, we propose a global and local feature fusion mechanism which improves the ability of detecting multi-scale pedestrians by fusing global information and local information of features. We replace the original function with the improved Repulsion loss function, which strengthens the performance of the model on occlusion scenarios. In the experiment on the WiderPerson dataset, our proposed model achieved an improvement of 7.3% in mean average precision and a reduction of 8.7% in miss rate, when compared to the baseline YOLOv5 model. And it also achieves a great balance between precision and speed in comparison with other classical models. -
Implicit Online Saddle Point Optimization
Xia Lei, Qing-xin MengAbstractAs a natural generalization of the Online Convex Optimization (OCO), Online Saddle Point Optimization (OSPO) involves a sequence of two-player time-varying convex-concave games. Instead of the duality gap used in most convex-concave optimizations, we choose dynamic Nash equilibrium regret (NE-regret) as the performance metric. We demonstrate that the implicit updates used to address OCO with dynamic regret can be extended to solve OSPO with NE-regret. To this end, we design two algorithms, the implicit online mirror descent-ascent and its optimistic variant. Analysis shows that their NE-regrets have the same expression form as the corresponding dynamic regrets of implicit updates in OCO. Empirical results further validate the effectiveness of our algorithms. -
MAFNet: Multi-attention Fusion Network for Infrared Small Target Detection
Wangqi Shen, Xiaofei Zhou, Zhi LiuAbstractInfrared small target detection is crucial for various applications, including surveillance and remote sensing. A common issue in this field is the loss of target information during the downsampling process, which can significantly impact detection accuracy. To address this challenge, this paper proposes a multi-attention fusion network (MAFNet), which primarily consists of two core modules: the Patch-aware Parallel Reconstructive Attention (PPRA) module and the Frequency Non-local Sparse Dimension Perception (FNSDP) module. The PPRA module enhances the encoder’s feature extraction capability through multi-branch feature extraction and a multi-channel attention parallel mechanism. The FNSDP module improves the model’s detection performance by fusing features from different dimensions, effectively preserving target details during multiple downsampling processes. We evaluate MAFNet on several public datasets, including NUAA-SIRST, IRSTD-1K, and NUDT-SIRST, and the experimental results demonstrate that MAFNet outperforms the current state-of-the-art methods across various detection metrics, highlighting its effectiveness and feasibility. -
Personalized Federated Meta-Learning Based on Gradient Clustering and Aggregation
Jiale Chen, Xiaoli Zhao, Hao PanAbstractWith the increasing demand for data privacy and security, Federated Learning (FL) has emerged as a critical distributed machine learning paradigm, safeguarding user privacy by exchanging model parameters between clients and servers rather than raw data. However, existing FL methods face two significant challenges when handling non-independent and identically distributed (Non-IID) data: poor model aggregation across users with varying data distributions and a lack of personalization in locally trained models. To address these issues, we propose a federated meta-learning method based on client gradient clustering and aggregation (GCA-FML), which enhances both model aggregation and personalization. In GCA-FML, clients are grouped according to the similarity of their gradients, thereby reducing interference among users with different data distributions through gradient clustering. Additionally, we employ attention mechanisms to dynamically adjust users’ meta learning rates, facilitating more personalized local models. Our experiments across multiple datasets demonstrate that GCA-FML outperforms other state-of-the-art methods in terms of accuracy and personalization. Notably, in Non-IID environments, GCA-FML significantly improves model convergence speed and personalization performance. -
FMS-YOLO: Lightweight High-Altitude Work Safety Belt Detection
Sangyu Yao, Fangfang Lu, Tong Zhou, Guxue Sun, Yijie HuangAbstractIn construction sites, the high fatality rate from falls due to not wearing safety belts underscores the need for real-time detection of safety belt usage at heights. Current detection methods for high-altitude work face challenges such as low accuracy, high computational demands, and large model sizes, which hinder real-time application. To address these issues, this paper introduces a lightweight model, FMS-YOLO, which utilizes a fused mixed local channel attention (MLCA) mechanism. Key strategies include integrating the lightweight FasterBlock module into all C3 (Cross Stage Partial Network) modules to accelerate inference and reduce parameter count; incorporating MLCA into the Neck for enhanced feature extraction and improved detection accuracy; and replacing the CIoU loss function with Shape-IoU, which focuses on the shape and aspect ratio of bounding boxes to enhance model robustness. Experimental results show that the proposed model achieves an mAP@0.5 of 96%, with only 5.4M parameters and the size of the weight file is only 11.3 MB. Compared to the original model, the proposed model achieves a 3% improvement in mAP@0.5 and a reduction of 1.2M parameters. This model meets the real-time detection requirements for monitoring the safety belt usage of workers operating at heights in construction site surveillance videos. -
Ink Animation Creation via Human-AI Collaboration
Youchun LiuAbstractThis paper focuses on the application and exploration of Artificial Intelligence-Generated Content (AIGC) in the creation of ink animation. As a uniquely Chinese art form, ink animation has undergone significant developmental stages, including the traditional hand-drawn period, the digital exploration phase, and the AIGC-driven era. Currently, AIGC technology, along with its efficient generation capabilities and diverse expressive potential, provides new creative pathways for ink animation. However, if we directly use closed-source AI video generation tools, such as Runway and Vidu, the generated ink painting style is unsatisfactory with consistent characters and predefined movements. To solve these problems, in this paper, we propose a novel deep Human-AI Collaborative framework (HAIC) for creating ink animation. Specifically, it first employs an open-source generation tool as initial creation, then incorporates a sophisticated adjusting mechanism to improve the art style. The adjusting module utilizes three steps to generate high-quality creations, i.e., model training, generation process control, and consistency optimization. Subjective evaluations comparing the results generated by Runway, Vidu, and the proposed method demonstrate that the latter significantly enhances the ink painting stylistic features, improves character consistency across clips, and stabilizes character movements. -
A Meta-space Architecture and Methods for Mobile Robot Inspection Digital Twin System
Xingdong Sheng, Yunhui Liu, Shijie Mao, Xiaokang YangAbstractMobile robot inspection has significant application value in the power and industrial sectors. This paper identifies several issues in traditional robot inspection systems, including robot navigation, task deployment, and defect/anomaly detection. A unified spatial map architecture called “Meta-space” is proposed based on high-fidelity reconstruction algorithm. This architecture enables modeling features, structure, appearance, and semantics of inspection scenes, along with corresponding implementation methods. By utilizing this map architecture, a digital twin-based mobile robot inspection system is developed, and reference methods based on this architecture to address the key issues are proposed, including robot localization and navigation, inspection task virtual deployment, and scene-based fine-tuning of the detection model. Finally, experimental results validated the effectiveness of the unified spatial map architecture and the associated methodologies.
-
-
Video Analysis
-
Frontmatter
-
Predict Pedestrian Flow in Open Street Environment
Zhenyang He, Yiling Zhao, Xiaozhong Zhang, Zhengyang Shi, Pingrui Lai, Hua YangAbstractPredicting pedestrian flow in open street environment presents substantial challenges due to the complex and dynamic nature of human movement. This paper proposes a novel model that integrates recurrent neural networks (RNN) with matrix factorization techniques to enhance temporal sequence prediction based on historical pedestrian flow data. Additionally, the model incorporates the topological structure of streets, making it adaptable to various urban environments and conditions. A specially designed encoder is used to effectively capture nuanced pedestrian flow information, thereby improving the training process and enhancing the model’s predictive capabilities. The implementation of a sliding window RNN framework further supports the dynamic analysis of crowd flow properties, including movement direction and anticipated pedestrian counts, by enabling real-time adaptation to fluctuating conditions. The proposed methodology is thoroughly evaluated using real-world datasets, demonstrating significant improvements over four baseline models and showing substantial promise for use in urban planning and the effective management of pedestrian traffic. -
VNNet: A Deep Learning-Based System for Video Visual Style Classification
Yaxin Bai, Qinglan Wei, Li YangAbstractThe video’s brightness, color, shot composition, and camera information are key visual elements that reflect the creator’s photographic style. To address the challenge of systematically extracting complex visual features, we have researched how to automatically extract five types of visual styles: brightness, color, camera stability, editing rhythm, and shot composition, and proposed a multi-level Visual Narrative Network (VNNet) system. To analyze the style of different categories of videos and compare the differences in visual expression techniques among creators, we have constructed a Cinematic Style Dataset. This dataset covers a variety of video categories, allowing us to delve into the visual differences between different works and explore the stylistic differences of individual creators. Experimental results show that VNNet has validated the effectiveness of the tags on the Cinematic Style Dataset through manual analysis, with an overall consistency rate of 75.29%, demonstrating its effectiveness and practicality in video visual style classification. These findings not only provide a new perspective for the automatic analysis and categorization of video content but also offer technical support for future video understanding and content creation.
-
-
Backmatter
- Titel
- Digital Multimedia Communications
- Herausgegeben von
-
Guangtao Zhai
Jun Zhou
Long Ye
Hua Yang
Ping An
- Copyright-Jahr
- 2025
- Verlag
- Springer Nature Singapore
- Electronic ISBN
- 978-981-9642-76-2
- Print ISBN
- 978-981-9642-75-5
- DOI
- https://doi.org/10.1007/978-981-96-4276-2
Die PDF-Dateien dieses Buches entsprechen nicht vollständig den PDF/UA-Standards, bieten jedoch eingeschränkte Bildschirmleseunterstützung, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen zur einfachen Navigation sowie durchsuchbaren und auswählbaren Text. Nutzer von unterstützenden Technologien können Schwierigkeiten bei der Navigation oder Interpretation der Inhalte in diesem Dokument haben. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com