Zum Inhalt

2025 | Buch

Digital Multimedia Communications

21st International Forum on Digital TV and Wireless Multimedia Communications, IFTC 2024, Hainan, China, November 28–29, 2024, Revised Selected Papers, Part II

herausgegeben von: Guangtao Zhai, Jun Zhou, Long Ye, Hua Yang, Ping An

Verlag: Springer Nature Singapore

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

Dieser Band enthält 27 ausgewählte Vorträge, die auf dem IFTC 2024: 21st International Forum of Digital Multimedia Communication präsentiert wurden, das vom 28. bis 29. November 2024 in Lingshui, Hainan, China, stattfand. Die 55 vollständigen Beiträge in diesem zweibändigen Satz wurden sorgfältig geprüft und aus 146 Einreichungen ausgewählt. Sie gliederten sich wie folgt in thematische Abschnitte: CCIS 2441: Affective Computing, Graphics & Image Processing for Virtual Reality, Large Language Models, Multimedia Communication, Application of Deep Learning and Video Analysis. CCIS 2442: Menschliche und interaktive Medien, Bildverarbeitung, Qualitätsbewertung und Quellcodierung.

Inhaltsverzeichnis

Frontmatter

Human and Interactive Media

Frontmatter
Rotation-Equivariant Human Motion Prediction via Quaternion Graph Convolutional Network
Abstract
Human motion prediction (HMP) is a crucial task in fields like human-computer interaction and autonomous driving, requiring accurate future motion prediction based on historical action data. However, most existing methods primarily focus on improving prediction accuracy, with much less attention on the robustness of models under varying viewpoints-a critical challenge in real-world applications. In this paper, we present RE-QGCN, namely Rotation-Equivariant Quaternion Graph Convolutional Network, a novel human pose prediction model that achieves enhanced generalization and robustness due to rotation-equivariant properties. Building upon traditional Graph Convolutional Network (GCN), we propose Quaternion Graph Convolutional Network (QGCN), with S-QGCN and T-QGCN as specialized variants for spatial and temporal dimensions respectively, enabling efficient extraction of spatio-temporal quaternion features. Our approach offers a more streamlined and compact design compared to other rotation-equivariant models. Experiments demonstrate that RE-QGCN achieves complete rotation-equivariance while maintaining competitive accuracy, thereby establishing a strong benchmark for future research on rotation-equivariant HMP models.
Yifei Zhang, Yin Hu, Jun Zhou, Yi Xu
HomeArena: A Playground for Household Appliance Intelligence Development and Evaluation
Abstract
Whole-Home Automation System (WHAS) has several constraints in intelligent experimental environment including complex physical environment, poor flexibility, high maintenance cost, resource limitation, lack of diversity in appliance and scenario, poor generalization performance of intelligent algorithm, and inability to provide heuristic ideas. We proposed HomeArena, a conceptual framework of multi-agent simulation for household appliance intelligence development and evaluation, and simulation process including modeling indoor environment, loading household appliance agent, lay outing sensors, building bionic agent, and performing automated simulation. We also discussed foundational design principles and implementation methods for various agents. Additionally, we also proposed a demo architecture of HomeArena, suggesting that a microkernel communication architecture based on ZeroMQ may be a satisfactory solution.
Xu Wu, Ruixun Kong, Chuchu Dong, Jun Zhou
Virtual Digital Intelligence in Broadcasting Television and Online Audiovisual Fields: Applications and Risks
Abstract
This paper comprehensively analyzes the current application status and development trends of virtual digital intelligence in broadcasting, television, and online audio-visual fields. It first reviews the origin, development, and key technologies of virtual digital intelligence, including character construction, voice generation, perception recognition, and analytical decision-making. Furthermore, it systematically analyzes the application of digital intelligence in key areas such as news broadcasting, media hosting, and e-commerce live streaming, emphasizing their significant contribution to enhancing user experience and increasing industry efficiency. Finally, this paper points out the risks associated with technological applications, including copyright infringement, reputation disputes, criminal risks, and ethical issues, highlighting the importance of developing and improving relevant laws, regulations, and ethical standards to promote the sustainable and healthy development of virtual intelligent being technology.
Xiaoling Zhu, Ran Bai, Xiaoye Ouyang, Xiaocheng Hu, Liu Yuan
Research on Cross-Modal Recommendation System Based on Deep Neural Network
Abstract
With the advancement of global digitization, the Internet multimedia industry has maintained steady and rapid development. Mainstream media platforms such as music and video websites, social networking software, and major Audio/Video rating databases have become an inseparable part of people’s daily life. For example, people upload homemade videos on video websites or watch platform-recommended content on Audio/Video rating websites. However, in these two scenarios, users face the following problems: 1. It is hard to find the right background music when editing videos; 2. The recommendation mode of Audio/Video database platforms is too homogeneous. This paper will combine the above specific scenarios, take the video and music matching problem as the entry point and propose the corresponding algorithm model. To improve the recommendation mechanism of multimedia websites, a cross-domain audio-video recommendation model KATLN, was proposed by integrating a joint attention mechanism with an adaptive adversarial network. Through experimental tests, this paper demonstrates the effectiveness and flexibility of the model. Cross-domain recommendation experiments show that the results are better than the current mainstream cross-domain recommendation algorithms.
Yan Gao, Xiaobing Li, Jingge Zhao, Yuan Zhang, Yun Tie
MIMCN: Multi-Interest Modeling with Capsule Network for News Recommendation
Abstract
In the era of information overload, personalized news recommendation methods are essential to help people find news information that they are interested in. Recognizing user’s multiple interests in news reading is one of the main purposes of news recommender systems. However, most of the existing methods use a unified vector to represent user’s interests without extracting and analysing user’s multiple interests. This leads to incomplete user’s interests modeling and impedes the performance of news recommendation. In this paper, we propose a model named MIMCN (Multi-Interest Modeling with Capsule Network) for news recommendation. MIMCN mainly contains a news encoder and a user encoder for multi-interest recommendation. For news encoder, we introduce a multi-view news encoder that can extract different aspects of features of each news. For user encoder, we introduce a multi-interest modeling method using a customized capsule network to learn user’s multiple interests from their browsing history. We also design a multi-interest aggregation module to avoid noise like misclicks. Experiments on two real-world news recommendation datasets show that our model can achieve better performance than other compared models.
Guotong Di, Zhiye Chen, Yongkang Guo, Chuanzhen Li, George Wang
FastTalker: Co-Speech Gesture Generation via Fast-Order Diffusion ODE Solver
Abstract
Existing research on co-speech gesture generation is predominantly grounded in high-quality gestures by virtue of diffusion models. Although diffusion-based gesture models with numerous sampling steps can achieve high accuracy, how to design a fast-generation method applied in resource-constrained scenarios still meets huge challenges. To mitigate the sampling burden, in this paper, we propose a novel FastTalker, a high-speed generation method for co-speech gestures that decreases sampling step size while keeping the accuracy. The proposed FastTalker consists of two key components, i.e., the gesture-aware latent diffusion module and fast-order diffusion ODE solver for guided sampling. Motivated by the consistency between speech and gesture, we design a transformer-based gesture-aware latent diffusion module to learn speech and speech text, which can efficiently generate the corresponding gestures based on speech. In order to achieve high-speed generation, we introduce an effective sampling mechanism by taking the sampling quality into account. Specifically, a fast-order diffusion ODE solver is employed for guided sampling enabling the gesture diffusion module with a large sampling step size as well as fewer steps. Experimental results indicate that our FastTalker accelerates the generation speed by a factor of 10 compared to the baseline, without compromising motion fidelity. Additionally, ablation studies also show the significance of fast sampling in high-speed gesture generation.
Xiaoying Huang, Sanyi Zhang, Binjie Liu, Xiaoxuan Guo, Long Ye
Multiplayer Interaction Feature Extraction for Skeleton-Based Action Recognition
Abstract
Existing human action recognition methods seldom consider the problem of multiplayer interaction in realistic scenarios, which makes them not robust enough for action recognition in the case of multiplayer interaction. To mitigate this issue, we propose a robust skeletal action recognition method based on graph convolution, specifically designed to handle multiplayer interactive actions. Existing methods extract action features of each person independently, and then pool and fuse the independent multiplayer features, which cannot effectively obtain the interaction features among multiplayer and also leads to information loss. In our approach, we construct multiplayer interaction adaptive graphs to effectively establish connections between different entities in multiplayer interactive actions. This enables the network to gather information about the interactions occurring among multiple entities. The experimental results show that our method achieves better performance on the NTU-RGBD dataset, and the recognition accuracy is improved by 4.1% and 5.5% on CS and CV benchmarks, respectively, compared with the baseline method.
Wuzhen Shi, Youwei Pan, Dan Li, Yang Wen, Yutao Liu
Bidirectional Information Fusion Time Series Transformer for Telecom Fraud Detection
Abstract
Telecom fraud causes harm to users, telecom operators, and even the stability of society. Detecting telecom fraud users through automated algorithms can prevent the occurrence of telecom fraud incidents. However, the complexity, variability, and concealment of telecom fraud behaviors pose significant challenges to detection algorithms. In this paper, in response to these challenges, we propose a new multivariate time series classification model, Bidirectional Information Fusion Time Series Transformer (BIFTST), for achieving high-performance telecom fraud detection. We utilize multiscale patches and the transformer to capture multiscale features, employ a segmentation module to divide the time series into segments, use a Bidirectional cross attention fusion module to effectively learn dynamic and static data, and integrate multiscale features using a gating fusion network. Additionally, we have collected a telecom fraud dataset that includes users’ multivariate time series data and static basic information. We have trained the model and conducted related experiments on this dataset. The experimental results show that our method achieves excellent results on the telecom fraud dataset and the effectiveness of each module of the model is verified through ablation experiments.
Shanzhi Jiang, Junhao Liu, Bin Lin, Zhaoqiang Cui, Yusheng Gao, Jie Sun, Zhi Liu
SSSGT: Silent Spiral Sparse Graph Transformer for Social Bots
Abstract
The Spiral of Silence theory posits that individuals who hold controversial or unpopular views often hesitate to express their opinions publicly. In modern online social networks, social bots have emerged as influential agents, mimicking human behavior and exerting a profound impact on public opinion dynamics. Consequently, the detection of social bots and the study of their effects on opinion formation have become increasingly critical research areas. In this paper, we propose Silent Spiral Sparse Graph Transformer (SSSGT), a novel and scalable method for detecting social bots and simulating opinion dynamics in complex social networks. Addressing key limitations of existing bots detection methods, SSSGT incorporates innovative graph sparsification techniques that reduce graph complexity while retaining essential structural properties. To further enhance the detection of social bots and simulate opinion dynamics, we propose Sparse Time Attention mechanism, enabling efficient tracking of temporal interactions in sparse graphs. Additionally, we present Hyper Kernel Attention layer for graph transformers, which improves computational efficiency without compromising performance. Extensive experiments conducted across multiple benchmark datasets demonstrate that SSSGT consistently achieves competitive results.
Shan Liu, Zheng He, Guoli Yang
USIAL-VC: A One-Shot Voice Conversion by U-Net-Based Encoder and Speaker Identity Adaptive Learning
Abstract
Voice conversion (VC) is an audio processing technology that converts the source voice into the target voice of another speaker without changing the linguistic content. There is still a significant gap between the target and converted voices in terms of voice quality and speaker similarity in one-shot VC, which remains a challenging issue to address. Feature disentanglement models have been previously used to separate speaker and audio content information. However, achieving effective disentanglement is challenging, which limits the practical applicability of these models. This paper proposes USIAL-VC, an improved U-net-based encoder and Speaker Identity Adaptive Learning model. The encoder of USIAL-VC employs a U-net architecture for down-sampling. The bottleneck layer of the model combines instance normalization and vector quantization to filter out speaker identity. Additionally, the residual information following the bottleneck layer is leveraged to adaptively learn the true speaker identity. Both objective and subjective results demonstrate that the proposed approach effectively captures the characteristics of the target speaker while preserving audio quality. Furthermore, in one-shot VC, the proposed method maintains strong performance in both audio quality and speaker similarity compared to other state-of-the-art VC models, even with only 30 s of target speech.
Yujiang Peng, Yutian Wang
PolyMotion-7K: A Multimodal-Driven Polyglot Avatar Motion Dataset
Abstract
The potential of multimodal-driven multilingual avatar motion generation for cross-cultural communication and low-resource language processing is becoming increasingly prominent. However, the lack of high-quality datasets covering multiple languages and modalities restricts the adaptability of avatar motion generation to linguistic and cultural diversity. To address this limitation, we present PolyMotion-7K, a comprehensive multimodal-driven multilingual avatar motion dataset designed to enhance motion diversity and detail in avatars within multilingual contexts. It encompasses over 7,000 languages, each represented by more than five hours of upper-body speech video material, collected from diverse sources to ensure cross-cultural applicability. In constructing this dataset, we employ the SHOW model for skeletal parameter initialization, incorporated the MediaPipe module to optimize joint poses, used DeepLabV3 for foreground segmentation, and applied Dreamwaltz-G to enhance visual quality, achieving high fidelity in fine-grained motion and expression rendering. To validate the effectiveness of PolyMotion-7K, we conducted an application experiment in the multilingual dissemination of intangible cultural heritage along the Beijing Central Axis, producing multilingual avatar-based introductory videos that provide narratives about the cultural background of various heritage sites. Experimental results demonstrate that PolyMotion-7K supports high-quality multilingual avatar generation, highlighting its broad potential for cross-cultural communication and digital heritage preservation.
Ming Meng, Xiaoping Hou, Yisheng Wang, Hanwen Liu, Yufei Zhao, Qin Yuan
MGTR-Avatar: Multi-scale Gaussian Triplane Representation for High-Fidelity 3D Facial Model Reconstruction from a Monocular Video
Abstract
Creating head avatar animations from monocular portrait videos is key to bridging virtual and real worlds. Existing methods like Explicit 3D Deformable Mesh (3DMM), neural implicit representations, and point clouds have advanced this field but face limitations: 3DMMs often produce overly smooth shapes due to fixed topologies, and neural implicit models require long training times with limited animation capacity. To address these limitations, we propose MGTR-Avatar, which combines animated 3D Gaussians with a parametric face model for photo-realistic avatars. Using FLAME as the initial point cloud, we convert model points into 3D Gaussian primitives and design a multi-scale triplane feature encoder with hybrid attention to capture detailed facial features. MGTR-Avatar captures diverse expressions and views, supporting real-time rendering and leveraging geometric priors for efficient training. Extensive experiments on the INSTA dataset show that MGTR-Avatar outperforms existing methods in both quality and speed.
Xinyuan Wen, Jiaxin Lin, Xiaochun Mai, Daquan Feng

Image Processing

Frontmatter
Learning Subimage-Adaptive Convolution Block for Real-Time Single Image Super-Resolution
Abstract
Most existing single image super-resolution (SR) methods employ the same fixed convolution kernels for extensively varied image contents, which are often suboptimal for the varied content in different image regions. To address this problem, we propose a novel building block called Subimage-Adaptive Convolution Block (SACB), which generates spatial-variant convolution kernels for HR image reconstruction, leveraging prior knowledge extracted from diverse subimages. In the SACB, multiple base convolution kernels are employed to progressively capture distinct hints from different subimages by parallel optimization during training, which can be re-parameterized into a single spatial-variant kernel in the inference stage to maintain decent efficiency. In essence, it utilizes predicted, adaptive linear combination coefficients of the base kernels for SR reconstruction. Based on the SACB, a unified solution pipeline named SACBSR is advanced. It consists of a shallow prediction module to dynamically generate different fusing coefficients for varying subimages, and an SR module made of a series of SACBs for adaptive SR reconstruction on the input image. Furthermore, our proposed SACB exhibits broad applicability across various CNN-based SR methods with enhanced image quality and promising computational efficiency. Extensive experiments over six benchmark datasets demonstrate the effectiveness and efficiency of SACB and SACBSR.
Taiheng Ye, Rui Zhang, Yi Xu
Light Field Image Super-Resolution Network Based on Attention Mechanism
Abstract
The inherent trade-off between spatial and angular resolution in light field (LF) imaging leads to the necessity of LF image super-resolution (SR). While existing methods have validated that combining both spatial and angular information can significantly improve LF image SR, they often fail to effectively merge these two components or model global relationships among sub-aperture images, limiting reconstruction quality. In this paper, we propose LF-ATnet, a novel super-resolution network based on attention mechanisms. Our approach introduces an angular incorporation module and a spatial-angular locally-enhanced self-attention module to capture local angular and global spatial features, respectively. A dual-branch structure is employed to facilitate efficient feature fusion and interaction. In the reconstruction phase, channel attention blocks and stacked multi-distillation blocks are applied to hierarchically modulate features, ensuring the recovery of fine details. Experimental results on public datasets demonstrate that LF-ATnet achieves superior performance in both visual quality and quantitative metrics compared to existing methods. Our method effectively combine spatial and angular information and reconstruct high-resolution LF images with rich textures.
Chenhao Han, Shixu Ying, Shubo Zhou, Yi Yang, Xiaoming Ding, Xue-Qin Jiang
A Quantitative Method for Visual Appearance Recognition of Swollen Eyes Based on 3D Information of Eye-Related Key Points
Abstract
Background: Medically, computed tomography and magnetic resonance imaging are the classic methods to describe the structure and morphology of the orbit, which is inapplicable in visual appearance assessment for aesthetics, facial expression analysis or psychophysical evaluation for convenient, rapid and portable reasons. Objective: to introduce a rapid, quantifiable, and reproducible method for the evaluation of vision appearance recognition of swollen eyes. Methods: About 2500 mesh vertices were extracted from the 3D facial information collected by 3D structured light scanner to establish the soft tissue key points set of the orbit (SKO). The data of 72 volunteers with or without swollen eyes (47 and 25 cases respectively) were used to establish the SKO regression equation for visual appearance judge (52 cases for modeling, 20 cases for external verification). Results: Linear regression to recognize swollen eyes based on SKO was established. The weight of the model was w = [47.85693945, 2.64343121, 79.23895416, −0.37536196], with an accuracy in test set of 90%. Conclusion: We proposed a quantitative method for visual appearance recognition of swollen eyes based on 3D information of eye-related key points. The quantitative evaluation based on SKO provides a rapid, quantifiable, and reproducible method to recognize swollen eyes for cosmetics, physical evaluation, or psychophysical evaluation.
Min Zhou, Yiyan Yang, Guangtao Zhai, Xuefei Song
Siamese Dual-Stage Network with Hierarchical Fusion for Remote Sensing Image Dehazing
Abstract
Remote sensing image dehazing is a key technology to improve the quality and visibility of images acquired from aerial or satellite platforms. It is widely used in many fields such as environmental monitoring, urban planning, disaster management and military reconnaissance. However, existing single-stage and multi-stage dehazing methods have limitations in processing complex scenes, detail recovery and haze removal. To this end, this paper proposes a novel Siamese Dual-stage Adaptive Dehazing Network (SDAD-Net). By introducing a Siamese sub-network with partially shared weights, the network can learn dehazing knowledge from each other at different stages, thereby enhancing the ability to constrain haze areas. The two sub-networks perform different degrees of dehazing on the image in two-stages. The output of the first-stage provides a prior for the second-stage and improves the image reconstruction effect. The network uses a Hierarchical Residual Fusion Module (HRFM) to perform multi-scale information fusion to provide richer information for image dehazing. Experimental results show that SDAD-Net performs better than existing dehazing methods on public remote sensing image datasets, especially in detail preservation and complex scene processing.
Jing Liu, Xin Lin, Junying Gao, Changhong He, Tao Yi, Xiangcheng Wan
CS2DMNet: Color Space Feature Interaction and Dual-Domain Multi-Scale Collaboration Network for Low-Light Image Enhancement
Abstract
Low-light images often exhibit low brightness, small contrast, high noise, and color distortion, which significantly impair visual perception. To improve image quality, existing low-light image enhancement networks only consider single color-space feature extraction and overlook the powerful feature representation ability of wavelet transform. To this end, we propose a Color Space Feature Interaction and Dual-Domain Multi-Scale Collaboration Network (CS2DMNet) to enhance low-light image so that the quality of enhanced image can be in line with human visual perception. Firstly, considering that one image on the HSV and RGB color spaces has its intrinsic characteristics, interactive guidance between them is proposed to improve color recovery in the RGB space while mitigating noise amplification in the HSV space. Secondly, unlike previous methods using simple Haar wavelet transform, we introduce Dual-Tree Complex Wavelet Transform (DTCWT) into our method to achieve simultaneous spatial-domain and frequency-domain feature enhancement for brightness magnification, correct color distortion, and texture restoration. Thirdly, an Adaptive Threshold Adjustment (ATA) block is proposed to reduce noise in the high-frequency components from DTCWT decomposition. Extensive experiments on publicly available datasets show that CS2DMNet surpasses state-of-the-art methods, yielding excellent visual results in color recovery and dark detail enhancement.
Wei Zheng, Lijun Zhao, Anhong Wang, Jinzhu Guo
A Near-Infrared Vein Image Semantic Segmentation and Localization Method Based on Dual-Branch Information Fusion
Abstract
With the widespread application of near-infrared imaging technology in medical and industrial fields, accurate semantic segmentation has become crucial. However, existing methods still struggle with segmentation accuracy under complex backgrounds and low-contrast conditions. To address this issue, this paper proposes a near-infrared semantic segmentation method based on dual-branch structures for semantic and detail information extraction. First, the model’s feature extraction capability is enhanced by incorporating depthwise separable convolutions and establishing a dual-branch architecture to separately capture semantic and detail information. Second, a multi-scale information fusion strategy is employed to integrate features from different levels to improve segmentation accuracy. Finally, the proposed method was systematically evaluated on multiple publicly available datasets, achieving a 5.95% improvement in the Dice coefficient and an average 2.3% increase in Recall compared to existing methods, highlighting its robustness and effectiveness in practical applications.
Gangyi Tian, Wen Ji
Remote Photoplethysmography Signal Measurement from Facial Videos Based on Enhanced Hybrid Convolutional Neural Network with Waveform Consistency Loss Function
Abstract
Recent advancements in remote photoplethysmography (rPPG) have enabled the extraction of blood volume pulses (BVP) from facial videos, facilitating the measurement of vital physiological indicators such as heart rate, blood oxygen, and blood pressure. However, achieving accurate blood pressure measurement requires more than just proximity to the true frequency, it also necessitates improving the precision of rPPG signal waveform features predicted from facial videos. This paper introduces an enhanced hybrid convolutional neural network (CNN) model featuring a dual-branch architecture, augmented by a skin color difference amplification module. By leveraging a designed waveform consistency loss function during training, the proposed model substantially enhances the accuracy of rPPG signal predictions. Our research aims to furnish a high-precision rPPG signal as a dependable prerequisite for subsequent physiological indicator measurements, notably in blood pressure. Evaluation on the UBFC-RPPG and V4V datasets demonstrates the model’s adeptness at capturing rPPG signals closely resembling labeled signals. Notably, the resulting signals exhibit mean absolute errors of 1.14 and 2.74, respectively, in heart rate measurement, underscoring the model’s robust generalization capability.
Fang Meng, Xin Pan, Tingfeng Huang
Lightweight Spatio-Temporal Attention Network for Video Super-Resolution
Abstract
Efficient utilizing spatio-temporal information of consecutive frames is the core of lightweight video super-resolution. Many researchers employ alignment or setting hidden states to aggregate spatio-temporal information. However, these methods are somewhat crude in collecting and optimizing spatio-temporal features, which reduces the information utilization and reconfiguration capabilities of the network. Thus, to alleviate these problems, we propose a lightweight spatio-temporal attention network. We design a frame selection and spatio-temporal attention module, which can effectively filter, collect, and fuse inter-frame information. Moreover, the backward spatial fusion module and forward spatial fusion module are proposed to aggregate spatio-temporal information over long distances. Meanwhile, we design the spatial supplementation module to enhance the optimization and reconstruction capabilities of the network. Experiments indicate that our model achieves state-of-the-art performance on multiple datasets.
Guofang Li, Yonggui Zhu, Zijun Zhao

Quality Assessment

Frontmatter
Video QoE Modeling by Spatial-Temporal Resolutions
Abstract
The issues related Quality of Experience (QoE) have been attracting increasing attentions in video transmission as the QoE is considered able to subjectively measure the overall experience of video applications and services. At the user end, the most intuitive and easily measured effect is the video intrinsic quality. Conventional methods inherently neglect in their computation of spatial resolution (i.e., frame size) and temporal resolution (i.e., frame rate). To estimate such influence, researchers have developed several empirical models by summarizing the results of subjective test. Although these studies have measured what the influence is, they have not yet addressed two other important issues: (1) why the influence exists from human visual perception perspective; and (2) how the influence is related to diversified natural video representation forms. In this research, we exploit human visual perception and natural video characteristics to provide a credible explanation on these issues. Through analyses and derivations, we propose an analytical model to illustrate the influence of spatial-temporal resolutions on video QoE modeling and characterization. This model is consistent with several existing empirical models and can also be verified with the scores obtained from comprehensive subjective test. In addition, the proposed model enjoys broader applications on video and image services without obtaining a holistic QoE measure.
Nadir Mustafa A. Mohamed, RunJie Wang, Ying Fang
An End-to-End Full-Reference and No-Reference Quality Assessment Model for 360 VR Videos
Abstract
Evaluating the perceptual quality of omnidirectional videos is crucial for optimizing virtual reality (VR) experiences, as these experiences rely on immersing viewers in a 360\(^\circ \) environment. Unlike conventional planar videos, 360\(^\circ \) VR videos require users to perceive viewport-based content through head-mounted displays (HMDs), where they can look around freely, creating unique quality assessment challenges. Traditional video quality assessment methods are often insufficient as they do not account for the interactive and immersive nature of VR. To address this, we develop a unified model that supports both full-reference (FR) and no-reference (NR) quality assessment methods for 360\(^\circ \) videos. Our model includes three main modules: a feature extraction module, a quality regression module, and a temporal pooling module. The feature extraction module utilizes a two-stream structure that examines the spatial degradation and motion characteristics of the video, capturing nuances specific to 360\(^\circ \) content. The quality regression and temporal pooling modules are designed to simulate the human visual system, providing accurate predicted quality scores. When tested on the VQA-ODV dataset, our approach demonstrates superior performance compared to other state-of-the-art FR and NR methods, highlighting its potential to improve VR content quality and enhance user experiences.
Dun Pei, Ziheng Jia, Wei Sun, Huiyu Duan, Xiongkuo Min
Do High Metrics Equal Enhanced Cognitive Performance? Exploring Objective and Subjective Assessments in Digital Human Quality
Abstract
Recent advancements in computer vision have driven the development of speech-driven 2D digital human lip-sync animation technology, which is now widely applied in fields such as animation, virtual idols, and online education. However, despite significant investments in improving relevant technical metrics, one crucial question remains underexplored: whether these optimizations truly enhance user cognition, as humans are the ultimate audience. To address this, we conducted a series of experiments by presenting digital human videos with varying technical quality levels and recording participants’ performance in information acquisition and cognitive tasks. Our goal was to assess whether higher technical metrics lead to better user outcomes. The results indicate that while higher technical metrics do improve information retention and visual clarity, they do not necessarily enhance subjective experience or learning outcomes, exhibiting a “plateau effect.” This finding highlights the need to integrate both objective and subjective evaluation criteria when optimizing speech-driven lip synchronization technology, particularly for educational contexts, and offers valuable guidance for the future development and optimization of this technology.
Jie Wang, Di Zhang, Qi Wu, CaiWei Huang, Hong Tan
Quality Assessment Indicators and Method of High-Resolution Space-Borne SAR Systems
Abstract
Space-borne Synthetic Aperture Radar (SAR) has emerged as a pivotal technology in remote sensing, offering high-resolution radar images irrespective of weather conditions and daylight availability. The quality of SAR images is crucial for its application in various domains, including disaster monitoring, target identification, and military reconnaissance. This paper mainly discusses some quantitative indicators of SAR image quality (including the definition of these indicators, physical meaning, and its measurement method in practice). It delves into the fundamental principles of SAR technology, outlines the key quality metrics, and discusses both subjective and objective assessment methods. Furthermore, this paper concludes with a discussion on emerging trends and future directions for SAR image quality assessment methods.
Xin Lin, Tao Yi, Xiangdong Li, Junying Gao, Xiangcheng Wan, Kaizhi Wang
Objective Evaluation of Ambisonics Recording Performance Using Arbitrary Microphone Arrays
Abstract
The recording technology for panoramic audio and video is gradually penetrating the consumer electronics market, and the technology for spatial audio recording using consumer electronics devices is receiving increasing attention. To accommodate different device form factors, there is a growing need for spatial audio recording techniques based on arbitrary geometric microphone arrays. This paper focuses on scene-based spatial audio, specifically Ambisonics technology, due to its significant advantages in emerging application scenarios such as panoramic audio and video, as well as extended reality. However, its high equipment requirements make arbitrary-array-based Ambisonics recording technology a challenging task. This study provides an overview of existing arbitrary-array-based Ambisonics recording techniques. Meanwhile, this study emphasizes the spatial characteristics of spatial audio, decoupling them and proposing corresponding objective metrics for evaluating the quality of Ambisonics audio. In a more realistic simulation environment, we implement the current mainstream methods and compare their algorithmic performance using the proposed metrics, offering meaningful references for research on arbitrary-array-based Ambisonics recording.
Congyu Huang, Zhiyu Li, Chuhan Qiu, Jing Wang, MaoShen Jia

Source Coding

Frontmatter
Frame-Level Complexity Control for Practical Encoder x265
Abstract
The rapid evolution of video coding technologies has led to a substantial increase in encoding complexity, posing significant challenges for the practical deployment of encoders in various applications. Accurate control over encoding time is crucial for maintaining performance across diverse scenarios, but there are limited solutions available to address this requirement. This paper presents a novel neural network-based approach for frame-level complexity control within practical video encoders, facilitating the adaptive distribution of encoding resources and the selection of appropriate coding presets for individual frames. We have developed a modified version of the x265 encoder that supports dynamic frame-level preset adjustments. The proposed complexity allocation and feedback mechanism effectively regulate the encoding budget for each frame, ensuring optimal utilization of resources. Additionally, a lightweight neural network model is introduced to predict and select the most suitable coding preset based on the target encoding time. Extensive experiments conducted on Class B and UVG datasets have demonstrated the effectiveness of our scheme, with the average control error maintained at a remarkably low level of 1.9%. This research offers a promising solution for achieving precise control over encoding time, thereby enhancing the efficiency and adaptability of video encoders in real-world applications.
Yan Wang, Jiangchuan Li, Guo Lu
A Human-Computer-Friendly Scalable Image Coding Scheme Based on the Canny Edge Detection Algorithm
Abstract
Image encoding technology has long been one of the most fundamental technologies for multimedia device transmission and storage. In the past, image encoding techniques primarily evolved to cater to human visual perception. However, as the amount of image data has significantly increased, most images no longer need to be viewed by humans but are processed by computers and other intelligent devices. This shift has created a need for image encoding technologies that serve both human visual perception and computer vision. Consequently, in recent years, many researchers have designed various human-machine-friendly scalable image encoding schemes to meet these requirements. Nevertheless, the number of human-machine-friendly scalable image encoding schemes remains relatively limited. This paper presents a human-computer-friendly scalable image coding scheme based on Canny edge identification technique. Firstly, Canny edge identification technique is used to accurately and comprehensively extract image edges and details. Feature analysis is then applied, followed by a generative model that reconstructs the image using the extracted characteristics and added reference pixels. Under this scheme, the closely packed edge map establishes a scalable connection between human and machine vision: the closely packed edge map serves the role of the foundational layer for machine-oriented vision tasks, while the fiducial pixels serve the role of an augmentation layer to ensure signal quality for human perception. Through incorporating a Sophisticated generative model, we trained a flexible network to rebuild the image from the closely packed feature representation and reference pixels. Finally, numerous experiments demonstrate the presented framework is superior to the Sophisticated approach in both human visual perception and computer vision tasks.
Yaqian Luo, Chao Yang, Ping An, Wenjing Ling, Xinpeng Huang
Backmatter
Metadaten
Titel
Digital Multimedia Communications
herausgegeben von
Guangtao Zhai
Jun Zhou
Long Ye
Hua Yang
Ping An
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9642-79-3
Print ISBN
978-981-9642-78-6
DOI
https://doi.org/10.1007/978-981-96-4279-3