Zum Inhalt

Image and Graphics

13th International Conference, ICIG 2025, Xuzhou, China, October 31–November 2, 2025, Proceedings, Part I

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Der dreibändige Band bildet den Abschluss der 13. Internationalen Konferenz für Bild und Grafik, ICIG 2025, die vom 31. Oktober bis 2. November 2025 in Xuzhou, China, stattfand. Die 138 vollständigen Beiträge in diesem Buch wurden sorgfältig ausgewählt und aus 420 Einreichungen überprüft. Diese Vorträge wurden in die folgenden Themenbereiche unterteilt: Künstliche Intelligenz, Maschinelles Lernen, Computervision, Mustererkennung, Rendering, Bildmanipulation, Grafiksysteme und -schnittstellen, Bildkomprimierung, Formmodellierung, Biometrie, Szenenverständnis, Vision for Robotics, Szenenanomalie-Erkennung, Aktivitätserkennung und -verständnis, Featureauswahl.

Inhaltsverzeichnis

Frontmatter

Artificial Intelligence

Frontmatter
SRG-Net: Semantic Relation-Guided Network for Commonsense Video Captioning

Commonsense video captioning requires the model not only to describe visible content but also to infer multiple types of commonsense captions, including “Intention”, “Effect”, and “Attribute”. Existing methods generally fall into two categories: one extracts commonsense directly from videos but struggles to bridge the semantic gap between visual content and implicit commonsense under limited knowledge; the other leverages language model-extended textual knowledge, which alleviates this gap but overlooks the semantic relationships among different types of commonsense information, limiting reasoning capability. To address these challenges, we propose a Semantic Relation-Guided Network (SRG-Net) for commonsense video captioning. Specifically, a Commonsense Semantic Relation Modeling (CSRM) module is designed to capture interrelations among different types of extended commonsense knowledge and enhance their representations. Furthermore, a Hierarchical Fusion Decoding (HFD) strategy is adopted. Multimodal video features are first fused, followed by the integration of enhanced commonsense representations, enabling the generation of accurate and fluent commonsense captions. Extensive experiments on the large-scale Video-to-Commonsense dataset demonstrate that SRG-Net achieves superior performance compared to existing methods across multiple metrics.

Zeyu Xi, Yijie Li, Haoying Sun, Haoran Zhang, Lifang Wu
EAANet: Edge-Aware Attention Network for Real-Time Road Scene Understanding

To address the accuracy-speed trade-off in semantic segmentation for resource-constrained driver assistance systems, this study proposes EAANet (Edge-Aware Attention Network). Building upon a Lightweight Progressive Scalable Network (LPSNet) as the baseline, we first embed Squeeze-and-Excitation (SE) channel attention into feature fusion layers to establish channel-wise adaptive feature selection. The Atrous Spatial Pyramid Pooling (ASPP) module is then enhanced through four parallel branches for improved multi-scale feature extraction. Further optimizations include reconstructing the decoder with depthwise separable convolutions and compressing the model size via parameter pruning. Experiments on the Cityscapes dataset demonstrate that EAANet achieves 76.7% mIoU, surpassing LPSNet by 3.2% points while reducing parameters to 1.2M. When deployed on NVIDIA Jetson Xavier NX edge devices, it attains real-time inference at 33.5 frames per second, with specific accuracy improvements of 9.0% for vehicles and 14.7% for traffic signs. The proposed model significantly enhances critical object recognition while maintaining real-time performance, offering a cost-effective semantic segmentation solution for vehicular edge computing platforms.

Chuyu Bai, Jianlin Yu, Xiaochun Lei, Zetao Jiang
Learning A Decomposition-Driven Two Stages Unfolding Artifact Removal Network for Compressed Images

In recent years, compression artifact removal technique has garnered significant attention in the field of image compression. However, existing artifact removal methods always adopt black-box networks for image enhancement. Although some explicable networks have better capability on reducing compression artifacts, they attach no importance to the role of image decomposition on network architecture design. To this end, we build a decomposition-driven two stages artifact removal framework for compressed images, which is composed of L1-norm and Low-Rank Constrained Structure Enhancement (LRC-SE) sub-optimization model and L1-norm Constrained Structure-Texture Enhancement (LC-STE) sub-optimization model. These two models can be unfolded into two explicable networks, named LRC-SE network and LC-STE network. The LRC-SE network is proposed to enhance the structure map, which is used as the initialization in the LC-STE network. Additionally, we design a low-rank latent representation block to decompose high-dimension features for effective feature extraction and redundancy reduction. Extensive experimental results demonstrate that the proposed method achieves superior performances for compression artifact removal in terms of PSNR, PSNR-B, and SSIM.

Lijun Zhao, Jie Zhao, Jinjing Zhang, Hao Ren, Yong Zeng, Anhong Wang
Martingale-Based Skin Lesion Segmentation from Dermoscopic Images

Accurate skin lesion segmentation is of great significance for improving the quantitative analysis of skin cancer from dermoscopic images. However, lesion segmentation remains a challenging problem due to the large differences in color, location, size, shape, and boundary contrast of lesions. In order to solve these difficulties, we construct a novel lesion segmentation algorithm based on martingale, which combines local and global information of the images. In order to combine the global information of the images in the segmentation process, we build a newly defined random power martingale (RPM) based on the statistical and structural features of the images. The unbiased nature of the martingale process optimizes the subtle boundaries and structural changes in the dermoscopic images. We compare our method with different outstanding algorithms and analyze them using some commonly used evaluation indexes in the International Skin Imaging Collaboration 2016 (ISIC-16) skin lesion dataset. Visualization results and quantitative evaluation show that our method can achieve superior performance.

Yao Lu, Yan Zhao, Shigang Wang, Jian Wei
Research on Adaptive Multi-layer Multi-pass Welding Technology for Medium-Thick Plates

The rapid development of industries such as heavy machinery, shipbuilding, and energy equipment has led to a sustained increase in the demand for medium-thick plate welding. However, conventional welding methods still heavily rely on manual operations or robotic teaching, resulting in low efficiency and poor consistency in weld quality. These limitations make it difficult to meet the high precision and productivity requirements of modern intelligent manufacturing. To address these challenges, this study proposes an intelligent multi-layer, multi-pass welding system for medium-thick plates, integrating deep learning and adaptive control. First, a deep convolutional neural network (CNN) based on the ResNet101 architecture was developed to automatically classify weld groove types, achieving a classification accuracy of 99.62%. Second, groove feature points were extracted with sub-millimeter precision using a Gaussian Mixture Model (GMM) clustering algorithm, enabling accurate analysis of geometric parameters. Finally, a multi-layer, multi-pass welding process system was designed with three key optimizations: adaptive adjustment of welding sequence and torch pose, dynamic compensation for wire stick-out, and compensation for thermal deformation and scanning errors. Experimental results demonstrate the system’s feasibility and effectiveness in industrial applications, significantly improving the level of welding automation. Manual intervention time was reduced by over 91.3%, and welding efficiency increased by 30%. This work offers a practical and engineering-ready solution and establishes a technical paradigm for the intelligent transformation of welding processes under the framework of smart manufacturing.

Ruiyun Zhong, Xiaoteng Wang, Xili Dai
M3E: Mixture of Multi-scale Multi-modal Experts for Time Series Forecasting

Recent research has shown that large language model (LLM) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, these approaches face a fundamental challenge that LLM operate on discrete tokens, while time series data is continuous. Meanwhile, recent works leverage pre-trained visual masked autoencoders (visual MAE) to construct time series forecasting foundation models. Nevertheless, converting time series into images disrupts the critical temporal dependencies of time series. Additionally, it is intuitive that different sampling scale time series exhibit distinct patterns, where the microscopic information is reflected in the fine scale, while the macroscopic information is reflected in the coarse scale. Based on these observations, we first generate multi-scale time series through downsampling to capture diverse temporal patterns, then we design a novel dual-modality encoding framework for long-term time series forecasting, consisting of an LLM encoding branch for discrete semantic reasoning and a visual MAE encoding branch for continuous representation learning. To effectively leverage the complementary strengths of both LLM encoding branch and visual MAE encoding branch, we propose a mixture of multi-scale multi-modal experts (M3E) to fuse features from the LLM with features from the visual MAE. Furthermore, M3E adaptively selects key-scale features from multi-scale features from LLM and visual MAE respectively to reduce computational costs, and facilitates multi-scale features interaction, which enables the capture of both short-term details and long-term patterns. Extensive experiments on six real-world datasets demonstrate M3E is a powerful time series model, outperforming state-of-the-art methods.

Shaobo Xie, Han Jiang, Chenlin Zhao, Xiaoshan Yang
PoseCLR: Bridging 2D and 3D Pose Representations via Contrastive Learning for Action Recognition

Graph Convolutional Network have emerged as a pivotal method in skeleton-based action recognition, demonstrating exceptional performance across multiple benchmarks. However, the inherent measurement errors in skeleton data—particularly the joint position estimation errors caused by occlusion—severely limit the recognition accuracy of existing models. To address this issue, this paper proposes a multimodal contrastive learning framework, PoseCLR, which effectively mitigates the impact of joint errors by leveraging the complementary characteristics of 2D and 3D skeleton data. Specifically, 3D skeleton data provides rich spatial information in three dimensions, while 2D skeleton data preserves more precise two-dimensional joint coordinates. The feature interaction between these two modalities achieves error compensation and information enhancement. Furthermore, this work supplements complete information on joint motion and bone motion, significantly improving the feature representation capability of individual modalities and thereby enhancing the performance of multi-stream score fusion. Experimental results demonstrate that the proposed method achieves breakthrough performance improvements on three mainstream datasets—NTU RGB+D, NTU RGB+D 120, and NW-UCLA—reaching current state-of-the-art methods.

Jianlong Lu, Weiyu Yu, Jinquan Chen, Dandan Qi, Shuang Gong
Art3D-Fusion: A Hybrid Framework for Visual Synthesis with Artistic Control

Traditional visual synthesis methods suffer from blurring, ghosting, and significant artistic deprivation due to their sole reliance on physical disparity reconstruction. These methods fail to capture the artistic elements that directors carefully design in classic 3D films. To address these issues, we propose Art3D-Fusion, a hybrid framework combining geometric processing with advanced diffusion models. This framework integrates physical depth information with global artistic control, particularly in 0-plane selection. Through depth estimation and optical flow-based disparity matching, we create a pseudo-realistic disparity map that reflects the director’s artistic adjustments. This approach enables precise depth-to-disparity transformation and ensures natural detail restoration and effective occlusion handling. Art3D-Fusion generates high-quality right-view images by driving a diffusion model under an enhanced ControlNet architecture using the artistic disparity map and original left-view image as dual-conditional inputs. This approach accurately represents the director’s artistic vision and allows for the transfer of artistic styles from classic 3D films to new scenes, ensuring the consistent reproduction of different directors’ artistic styles through the estimation and transfer of camera parameters.

Kohou Wang, Ping Chen, Zhaoxiang Liu, Xiang Liu, Zezhou Chen, Huan Hu, Xin Wang, Kai Wang, Shiguo Lian
Lesion Localization Prior-Driven Few-Shot Learning for Branch Atheromatous Disease Diagnosis

Early diagnosis of Branch Atheromatous Disease is critical for reducing associated disability rates. Recent advancements underscore the potential of deep learning methods in developing automated diagnostic tools. However, the limited availability of clinical data poses a significant challenge to their practical application. Traditional deep learning models, which rely on large-scale datasets, perform poorly in few-shot scenarios, limiting their effectiveness in this context. To address these challenges, we propose a few-shot learning framework for Branch Atheromatous Disease diagnosis that leverages Lesion Localization prior knowledge. This approach incorporates domain expertise to guide data augmentation, effectively mitigating the issue of limited training data. Specifically, our framework is developed based on 251 BAD slices, 51 Non-BAD slices, and 400 lesion-free slices obtained from preprocessed clinical DWI images. Furthermore, we introduce a two-stage training strategy and an adapter module for parameter-efficient fine-tuning, enabling effective model optimization even with constrained data. Our method was evaluated on clinical cases from multiple medical centers, demonstrating superior diagnostic accuracy and robustness compared to various baseline models. These results highlight the potential of our approach to enhance the efficiency and accuracy of early BAD diagnosis and alleviate clinical diagnostic workload.

Kaijun Zhang, Shengde Li, Shengpei Wang, Jie Peng, Shangyi Shi, Bin Peng, Jun Ni, Huiguang He
Deep Multi-sentence Aligned Cross-Modal Retrieval

Current cross-modal retrieval models predominantly rely on one-to-one image-text pairs for training, with most approaches projecting each sample into a single embedding vector. However, recent studies indicate that this single-vector representation fails to capture the inherent complexity of images and texts, resulting in the loss of critical semantic information during the embedding process. Consequently, these models often struggle to effectively learn the nuanced features of both modalities. In this paper, we propose a novel training strategy that transitions from traditional one-to-one image-text pairing to a one-to-many framework and introduce an innovative Set Prediction Module with Weight to better capture the diverse semantics of the input data. By incorporating a more diverse set of textual representations during training, our method significantly enhances the performance of cross-modal retrieval models. Extensive evaluations across various backbone networks on the COCO and Flickr30K datasets demonstrate that our approach consistently outperforms most existing methods.

Zhijian Lin, Sihan Gong, Xueliang Liu
Single-Layer Denoising Taylorformer for UAV Nighttime Tracking

The automation of unmanned aerial vehicles (UAVs) has been driven in large part by vision object tracking methods with onboard cameras. However, random and complex real-world noise in captured imagery severely degrades the performance of state-of-the-art (SOTA) UAV trackers, especially under low-illumination conditions. To address this challenge, we propose a prompt-guided Taylorformer and design a plug-and-play, single-layer denoising network (SDT) aimed at suppressing heterogeneous noise and thereby improving UAV tracking performance. Specifically, our lightweight single-layer architecture employs minimal network depth to reduce computational overhead. We introduce prompt-guided Taylor self-attention (PTSA) and prompt-guided Taylor cross-attention (PTCA) to form a raw feature extraction (RFE) encoder and a multi-feature fusion (MFF) decoder, respectively, enhancing both feature extraction and fusion capabilities. In addition, we develop a multi-scale feed-forward network (MSFN) that more effectively leverages noise information across multiple receptive fields to further optimize network performance. Extensive experiments demonstrate that our proposed SDT achieves significantly denoising efficacy and substantially enhances UAV nighttime object tracking accuracy.

Zihao Su, Haijun Wang, Lihua Qi
Position-Aware Text-to-Image Generation with Efficient Controllability

In recent years, Stable Diffusion has significantly advanced the quality of text-to-image generation. However, accurately interpreting and representing Spatial Layouts specified by text prompts remains challenging. Existing approaches typically rely either on extra grounding information or utilize LLMs (large language models) combined with layout-controllable models which suffer from high computational costs. To address these limitations, we propose a novel method SamLayGe, a lightweight and efficient layout generation model designed for seamless integration into existing text-to-image pipelines. SamLayGe autonomously generates comprehensive layouts without requiring explicit user inputs, thus surpassing current LLM-based and layout-controllable approaches in terms of versatility and efficiency. Furthermore, we propose LayGeBench, a benchmark dataset addressing ambiguities in spatial descriptions of prior datasets. Extensive evaluations demonstrate that SamLayGe consistently produces images that accurately adhere to textual layout descriptions, achieving superior performance in terms of both accuracy and computational efficiency. Code is available at https://github.com/shenlanzhuanshu/caption-to-positional-layout . Position-Aware Text-to-Image Generation with Efficient Controllability.

Junchao Gu, Xiangyu Wang, Yuchen Du, Hao Chen
Introducing DINOv2 for Medical Image Boundary Tracking

Medical image segmentation task plays an important role in areas such as clinical care and medical atlas construction, but existing 2D segmentation models often face challenges in 3D organ or tissue segmentation tasks, and 3D models account for a large computing resource proportion. Therefore, we consider 3D tissue organs as video sequences and use the given first layer of tissue organ boundaries as a sequence of query points to predict matching points on subsequent slices. Based on this idea, we propose a point tracking architecture optimised based on the joint point tracking model CoTracker, which compensates for the shortcomings of CNN-based architectures in global feature information extraction and improves the robustness of the overall image features by incorporating the robust, strongly generalisable DINOv2 encoder. The model uses semantic features extracted by the self-supervised learning foundational model DINOv2-ViT for feature fusion with ResNet, and optimises them with a fine-tuning strategy based on the CoTracker weights. Specifically, we introduce the channel attention mechanism to make full use of Vision Transformer’s feature recognition capability, and achieve feature optimisation by filtering high-weighted channels, thus improving the accuracy of several evaluation metrics in the point tracking domain. Extensive evaluation results show that our approach not only greatly saves training resources, but also efficiently improves tracking accuracy and has good generalisation in the field of medical image segmentation.

Jing Chen, Gangming Zhao, Jun Wu, Chong Tian, Chao Liu, Lei Qu
Adaptive Pruning and Cross-Domain Feature Fusion for Robust Object Tracking

Although Transformer-based methods have become the mainstream solution for single object tracking tasks, their lack of spatial inductive bias makes it difficult for them to effectively distinguish between targets and background in complex scenes. Moreover, relying on positional correlations between patches often causes the model to attend to local regions instead of the entire target, resulting in tracking box drift. To address these issues, we propose a coarse-to-fine target localization approach for object tracking, named Adaptive Pruning Cross-domain Fusion (AdaCF) Tracking. Firstly, The last-rank elimination attention selection strategy, called Routing Attention, is employed to adaptively prune background information that is dissimilar to the target features. This mechanism not only enhances the model’s ability to distinguish the target from the background by reducing background noise interference, but also significantly improves inference efficiency by eliminating redundant computations. Secondly, in the cross-domain fusion module, we first apply the Discrete Cosine Transform (DCT) to map spatial information into the frequency domain. Then, we implement joint spatial-frequency enhancement. This enables the model to further distinguish between the target and interfering information, such as accompanying objects, similar backgrounds, or look-alike objects. Thirdly, we introduce the Efficient Intersection over Union Loss (EIOU), which significantly accelerates model convergence and improves the localization accuracy of anchor boxes. The experimental results on the GOT-10k dataset demonstrate that our method significantly improves tracking performance, with AO increased by 3.3% and $$\text {SR}_{0.5}$$ SR 0.5 by 4.1%, validating its enhanced stability and adaptability in complex scenes.

Jing Wen, Xufeng Li, Songsong Zhang, Yujun Wu
Data Leakage Detection in Large Vision-Language Models via Multimodal Perturbation

The training data for large vision-language models (LVLMs) is typically sourced from large-scale corpora, which may inadvertently include copyrighted or sensitive content, raising concerns about private data leakage. However, detecting such leakage remains challenging due to the opaque internal mechanisms of LVLMs. We observe that memorization in large language models (LLMs) can cause LVLMs to generate distinct responses to seen versus unseen inputs. This discrepancy offers a viable signal for privacy leakage detection. In this paper, we propose DLD-MP, a novel framework for Data Leakage Detection in LVLMs through Multimodal Perturbation. DLD-MP comprises three components: Multi-Level Image Perturbation (MLIP), Key Semantic Mask-based Text Perturbation (KSMTP), and a Leakage Evaluator (LE). Given an image-text pair, MLIP applies perturbations to the image at multiple semantic levels, while KSMTP selectively masks key semantic tokens within the corresponding text. The perturbed inputs are then fed into the target LVLM to perform masked text prediction and vision-language understanding tasks. LE assesses the model’s responses against predefined rules to determine whether the input is used during training. Extensive experiments on multiple benchmarks demonstrate the effectiveness of DLD-MP.

Xin Wang, Zhaoxiang Liu, Yue Zhan, Kaikai Zhao, Kai Wang, Shiguo Lian
A Novel Dual-Branch Cross-Attention Transformer Network for Low-Dose CT Denoising

Low-dose computed tomography (LDCT) has attracted widespread attention in medical imaging due to its significant reduction in radiation exposure. However, compared to normal-dose CT (NDCT) images, LDCT images often contain considerable noise and artifacts, which severely affect diagnostic accuracy. In this paper, a novel dual-branch cross-attention transformer denoising network (DCANet) is proposed for LDCT denoising, which decouples and models the input images in different feature spaces through two complementary branching structures, and achieves feature complementarity and synergistic optimization through the fusion mechanism to improve overall characterization capability. The proposed DCANet includes Residual Triple Attention Blocks (RTAB) and a Cross-Attention Transformer module (CAformer), applied to the upper and lower branches respectively, effectively enabling the synergistic fusion of local detail enhancement and global structural modeling. Additionally, a joint optimization strategy using perceptual loss and Charbonnier loss is adopted, enabling the method to efficiently suppress noise while accurately preserving key structural and textural information in the images. Experimental results on the Mayo LDCT dataset demonstrate that the proposed DCANet achieves significant improvements in both quantitative metrics and perceptual quality.

Yuqin Li, Mengcheng Huang, Xu Wang, Fei He, Zhengang Jiang
TCGFNet: Multi-scale Transformer-Convolution with Geometry-Guided Feedback for Robust Point Cloud Denoising

Point-cloud denoising should suppress noise without harming geometry, classical approaches often oversmooth complex object surfaces, erasing fine details. Although recent learning-based techniques alleviate these shortcomings, they still face two critical limitations: (i) single-scale feature extractors without explicit positional encodings struggle to capture long-range dependencies, and (ii) stage-wise pipelines that update normals and coordinates separately accumulate errors. We propose an end-to-end framework that fuses multi-scale convolutions with a position-encoded Transformer and a geometry-guided feedback loop. Convolutions capture local detail, the Transformer models global relations, and residual fusion unifies their features. A key-point selector, driven by normal orthogonality and cross-scale agreement, retains only high-confidence points, while a bidirectional module jointly refines coordinates and normals under a geometric-consistency loss. On synthetic CAD and non-CAD datasets corrupted with 0.6– $$2.0\,\%$$ 2.0 % Gaussian noise, the method achieves lower Chamfer and point-to-surface distances than five state-of-the-art baselines. Ablations confirm each component’s value, and qualitative results preserve sharp edges and high curvature, providing an efficient, practical solution for point-cloud denoising.

Kekun Jin, Xiwu Shang, Guoping Li, Guozhong Wang
Adversarial Iterative Pre-enactment Framework for Air Combat Based on Mental Simulation Theory

Current air combat decision methods mostly lack structured adversarial validation and adaptive reflection mechanisms, limiting their robustness and reliability when facing deceptive or unfamiliar enemy strategies. To address this, we propose the Adversarial Iterative Pre-Enactment (AIP) framework, which integrates cognitive mental simulation with a large language model. First, we design a self-adversarial pre-enactment–feedback loop that enables the model to simulate and evaluate both friendly and enemy actions before execution. Second, we introduce a multiscale segmented evaluation mechanism that analyzes tactical effectiveness across different time horizons and perspectives. Third, we build a high-fidelity simulation environment with scenario rewind capability, enabling real-time dual-strategy execution and supporting structured adversarial evaluation. Experimental results demonstrate the superiority of AIP over standard LLM baselines in terms of both tactical score and stability under complex adversarial conditions.

Songde Han, Xiao Zhang, Tianyu Hu, Huimin Ma
SA-Pillar: Structure-Aware Feature Learning for Real-Time 3D Object Detection

Single-stage 3D object detection based on pillar structures has gained attention for its high inference efficiency in autonomous driving. However, the quantization of point clouds into pillars often leads to the loss of fine-grained structural details, limiting performance in detecting small objects under sparse scenarios. To overcome this challenge, we propose an efficient LiDAR-based detection algorithm that incorporates multi-source feature encoding and structure-aware enhancement. Specifically, we design a Structure-Aware Feature Encoding (SAFE) module that integrates intra-pillar attention, inter-pillar structural modeling, and height histogram encoding to improve the geometric representation of pillar features. The backbone further incorporates a large kernel attention mechanism to capture long-range dependencies, along with an atrous spatial pyramid pooling module and a weighted dual-scale feature fusion module to strengthen semantic expressiveness and detection accuracy. The proposed method is evaluated on the KITTI dataset, with extensive visualizations and ablation studies. Results show an improvement of 3.9% in mean 3D average precision (mAP) over the baseline, including 3.66% and 4.59% gains for pedestrian and cyclist categories. The model runs at 34.2 FPS, demonstrating both accuracy and efficiency for real-time applications.

Shanshan Huang, Wenkang Chen, Zhengrong Xu, Xuejun Zhang
Knowledge-Aware Intent Subgraph Learning for Recommendation

The Knowledge Graph (KG) augmented recommendation mitigates the cold start issue by exploiting complex semantic clues in users’ behaviors. However, existing methods focus on bringing these auxiliary knowledge into the user-item interaction space, ignoring the gap between different sources. In this study, we propose a knowledge-aware intent subgraph learning (KISL) which mines users’ intents with KG to promote fine-grained interest learning for personalized recommendation. We model each intent representation as an attentive combination of KG relations on the knowledge graph. Guided by the intents, KISL devises a dimensional disentanglement to divide the interaction graph into several augmented intent-aware subgraphs. Fine-grained personalized embedding is learned during subgraph message propagation to predict users’ interactions. Extensive experiments on two public datasets demonstrate the effectiveness of KISL by knowledge-aware intent modeling over baselines.

Langchen Lang, Meng Jian, Wei Zhou
PF-DETR:Enhanced DETR with Pre-encoded Feature Fusion for Small and Multi-scale Object Detection in UAV Imagery

Object detection in UAV imagery presents a significant challenge due to low image resolution, complex visual backgrounds, and substantial inter-object scale variation. Detection Transformer (DETR)-based models often underperform in such scenarios. This paper proposes Pre-Fusion Guided DETR (PF-DETR), a refined and enhanced extension of the D-FINE model, specifically designed for robust small-object detection in UAV imagery. PF-DETR introduces a novel Pre-Encoded Feature Fusion (PEFF) module, which employs a unidirectional fusion strategy to integrate low-level spatial features into high-level semantic features, thereby improving the model’s ability to capture fine-grained details of small objects. Additionally, a Feature Enhancement Attention (FEA) mechanism is incorporated to strengthen feature representation and reduce background interference. To address information loss during multi-scale processing, a Haar Wavelet Downsampling (HWD) module is proposed, combining Haar wavelet transforms with convolutional downsampling to better preserve critical features. Furthermore, the original GIoU loss is replaced with an Enhanced IoU (EIoU) loss function to improve localization accuracy and training efficiency. Experimental results on the challenging VisDrone2019 benchmark demonstrate that PF-DETR outperforms state-of-the-art detectors, including DEIM, D-FINE, YOLOv10, and YOLOv11, in both detection accuracy and robustness. Specifically, PF-DETR achieves an AP50 of 40.6% and an AP of 23.6%, representing improvements of 2.0% and 1.5% over D-FINE, 2.7% and 1.8% over DEIM, and notable gains of 7.6% and 8.3% in AP50 compared to YOLOv11n and YOLOv10s, respectively. These results underscore the strong potential of PF-DETR for real-world UAV applications requiring precise small-object detection under challenging conditions.

Lina Huo, Jianing Qian, Wei Wang, Benchao Guo
Selective Labeling for 3D Shape Label Transfer Based on Local-Global Features

For deep learning models, the selection of samples for annotation has a profound impact on the training results, particularly when working with limited annotation budgets. This problem is challenging and underexplored. Our goal is to identify the most important data for annotation that can be used for transferring shape labels for 3D point clouds. Intuitively, the selected 3D shapes should be representative and cover local and global shape structure variations. Towards this goal, we introduce a stochastic optimization strategy that uses both local and global features to select the most relevant 3D shapes for annotation. Our method is effective, extensive experiments have demonstrated that our method can achieve superior performance compared with the state-of-the-art methods on two 3D shape label transfer tasks, namely 3D shape segmentation label transfer and key point transfer.

Zhigeng Pan, Xin Zheng, Nenglun Chen, Rongjin Zou

Biological and Medical Image Processing

Frontmatter
MAA-Net: A Multi-attention Aggregation Network for Segmentation of Key Structures in Microvascular Decompression

Microvascular decompression (MVD) plays a critical role in the treatment of neurovascular compression-related diseases, with its success heavily dependent on the precise preoperative identification of key anatomical structures, especially small-volume and densely distributed tissues like nerves and vessels. To address this challenge, we propose a multi-attention aggregation network (MAA-Net), for the segmentation of MVD-related structures in MRI images. The method is based on the U-Net architecture and incorporates two attention mechanisms. A spatial-channel parallel attention module at the bottleneck jointly models spatial and channel dependencies to better capture complex interwoven anatomical structures, particularly in regions where nerves and cerebral vessels are intertwined. In addition, a lightweight gated attention module is inserted between the encoder and decoder to improve the perception of small-volume nerve targets by promoting feature selectivity and suppressing background noise, especially when processing fine nerve structures. We evaluated the proposed method on a private clinical dataset covering the brainstem, nerves, cerebral vessels, and cerebellum. The results demonstrate that our method achieves superior performance in volume overlap metrics (e.g., Dice score), delivers precise boundary delineation in distance-based metrics (HD95 and ASD), and exhibits strong recognition ability for elongated structures as reflected in the clDice score, confirming its practical value for preoperative MVD localization.

Jinghua Yue, Fugen Zhou, Qinglong Yao, Bo Liu, Yulian Zhang, Xueke Zhen, Yanbing Yu
Contrastive Hierarchical Graph Based Multiple Instance Learning for Fundus Screening

Fundus imaging techniques, such as Fundus Fluorescein Angiography (FFA) and Optical Coherence Tomography (OCT), serve as pivotal diagnostic tools for retinal disease detection. These modalities reveal intricate variations in retinal tissue structure, thereby aiding clinicians in accurate diagnosis and treatment planning. Recent advancements in Multiple Instance Learning (MIL) have transformed the analysis of fundus images datasets by effectively extracting discriminative features from image-level key instances. However, a critical limitation of conventional MIL methods lies in their neglect of region-level key instances (e.g., localized lesions) within the 2-D spatial domain, particularly in fundus imaging where pathological regions are often sparse. To address this gap, we propose a Contrastive Hierarchical Graph based Multiple Instance Learning (CHG-MIL) framework, which integrates three novel components: (1) a Spatial Instance Graph (SIG) module that preserves 2-D spatial topology and mines contextual relationships among neighboring region-level instances; (2) a Hierarchical Instance Interaction (HII) module that utilizes deeper semantic representations to refine shallow-layer features through cross-hierarchy guidance; and (3) a Hierarchical Contrastive Loss (HCL) designed to suppress redundant, non-disease-related features in nodes and instances. Extensive experiments on the APTOS2023 and GAMMA datasets demonstrate the superiority of our proposed method over existing state-of-the-art MIL methods.

Yubo Tan, Shiye Wang, Wenda Shen, Yong-Jie Li
Polyp Segmentation Based on Edge Guidance

Accurate segmentation of polyps in colonoscopy images is of vital importance for the early diagnosis and treatment of colorectal cancer (CRC). However, accurate segmentation is a challenge due to the diversity of polyps in size and shape as well as unclear boundaries. To this end, we propose a novel edge-guided network (EGNet) for polyp segmentation, which achieves cross-level feature fusion by effectively utilizing edge information and enhances the focus on polyp edges. Specifically, we first propose a multi-scale enhancement module (MSEM), which strengthens the feature representation capability through the interaction between convolutional kernels of different sizes. Next, we propose an edge-aware guided module (EAGM), which extracts more discriminative edge features and enhances the model’s sensitivity to edge details. Finally, we propose a cross-level fusion module (CLFM) that integrates contextual cues from different levels to enhance the semantic perception of target regions in complex backgrounds. Our experimental results on four benchmark datasets show that EGNet outperforms other state-of-the-art methods.

Yulong Bai, Xiuhong Li, Kuan Wang, Boyuan Li, Haodong Zeng, Wenjing Guo
A Deep Unfolding Based on U-Net Graph-Guided Hybrid Regularization Method for Bioluminescence Tomography

Due to the strong scattering and low absorption of light in biological tissues, the inverse problem of bioluminescence tomography (BLT) is ill-posed. Hybrid regularization constraints can effectively alleviate the inherent ill-posedness of the BLT inverse problem. However, the hybrid regularization of traditional algorithms involves multiple parameter determinations, and it is difficult to select parameters manually. This paper proposes deep unfolding based on U-Net graph-guided hybrid regularization (DUnet-GGHR) method for BLT. The gradient update step in the traditional GGHR algorithm is reformulated as a flexible gradient descent update module for automatic learning. At the same time, the soft threshold calculation step in the traditional GGHR algorithm solution process is expanded into a proximal mapping module for training. A series of reconstruction experiments have verified that the proposed DUnet-GGHR model performs better than other end-to-end deep unrolling comparison methods in tumor localization, morphology restoration, and energy recovery. At the same time, the mathematical process is combined with the deep neural network to improve the interpretability and generalization of the network, reduce training data, and speed up the calculation.

Wei Lyu, Mengxiang Chu
CMambaR: Cardiac Phase Embedded Vision Mamba for Accelerating Cardiac MRI Reconstruction

Cardiac Magnetic Resonance Imaging (CMR) is a crucial clinical imaging modality for assessing cardiac morphology and function, and it has become the gold standard for diagnosing cardiovascular diseases. However, its widespread clinical adoption is hindered by long acquisition times and high costs—challenges that are particularly acute in dynamic imaging, where both high spatial and temporal resolutions must be achieved within a limited timeframe. In this paper, we propose a novel dynamic deep unrolling method, CMambaR, a cardiac phase-embedded Vision Mamba architecture designed to accelerate cardiac MRI reconstruction. The proposed method integrates the strengths of unfolded iterative optimization with a spatiotemporal dynamic reconstruction network, enabling it to effectively capture complementary information embedded in dynamic sequences while leveraging physics-based priors to deliver high-quality reconstruction. Inspired by structured state space models, we design a local enhanced vision Mamba module as the core building block of our network, capable of capturing both local details and long-range dependencies. Furthermore, we introduce a cardiac phase fusion mechanism that incorporates cardiac phase prior into the reconstruction process, further enhancing reconstruction performance. Extensive experiments on two cardiac datasets demonstrate that our method achieves high-fidelity image reconstruction and consistently outperforms existing approaches.

Bangjun Li, Jingchuan Wang, Mengli Xue, Yujun Li, Zhi Liu
SC-DSE-nnUNet: An Efficient Hippocampus MRI Segmentation Method

To address the issues of poor accuracy caused by complex structures and noise interference in hippocampus MRI segmentation, an improved nnU-Net segmentation algorithm is proposed. First, the introduction of Self-Calibrated Convolutions enhances the ability to capture irregular targets and edge details. Second, an adaptive threshold-constrained dynamic channel attention mechanism is proposed to optimize the allocation of channel weights, strengthening the features of the target region while effectively suppressing noise interference. Experimental results show that the proposed algorithm significantly outperforms nnU-Net in terms of Dice coefficient, IoU, and sensitivity. On the MSD Hippocampus and LPBA40 datasets, the DSC improved by 1.72% and 4.52%, the IoU improved by 2.89% and 5.4%, and the recall rate improved by 1.55% and 4.21%, respectively. Further visualization analysis of the segmentation results confirms that the proposed algorithm demonstrates excellent performance in segmenting complex structures such as hippocampal boundaries, vascular bifurcations, and lung infection regions.

Bowen Xiao, Yu Ma
Spatiotemporal Feature Fusion for Glioblastoma Recurrence Prediction Using Mamba-Based Dual-Stream Framework

Glioblastoma (GBM) is the most aggressive glioma with a 5-year survival rate of only 6.8% and a median overall survival of 8 months. Accurate prediction of recurrence is essential for personalized treatment. Current imaging-based approaches face two major limitations: 1) reliance on single-time point images fails to capture tumor dynamics, and 2) the limited receptive fields of CNN and the quadratic complexity of Transformer hinder effective 3D MRI processing. To address these limitations, we leverage Mamba’s linear computational complexity and global modeling capabilities, and our model employs the following techniques: 1) a dual-stream framework for prediction of recurrence; 2) the BiSMamba module for multi-scale feature extraction; 3) the multi-stage fusion module for capturing dynamic changes. Extensive experiments on two public datasets (RHUH-GBM and LUMEIERE) have shown that our approach achieves impressive results in prediction of glioblastoma recurrence.

Chengwei Chen, Dong Huang, Yao Zheng, Jie Wei, Yuefei Feng, Tianci Liu, Junmei Feng, Yang Liu
Automatic and Fast Segmentation of Cochlear Implant-Induced Artifacts in MR Images Using Deep Learning

Magnetic Resonance Imaging (MRI) plays a vital role in medical and biological applications. However, for patients with MRI-compatible implantable devices such as cochlear implants, the presence of integrated magnets often leads to large signal voids and severe artifacts, significantly compromising diagnostic accuracy. Although recent advances in deep learning have shown promise in artifact reduction and image enhancement, the quantitative assessment of artifact regions still heavily relies on manual annotation, which is labor-intensive and inconsistent. In particular, boundary distortions and tissue loss near cranial regions pose significant challenges for accurate artifact delineation, limiting the effectiveness of existing segmentation methods. To address these issues, we propose a novel 3D artifact segmentation framework that integrates reflective registration into a deep neural network combining U-Net and Transformer architectures. We conducted experiments on MRI data from 5 real-world patients with cochlear implants. Experimental results demonstrate that our method achieves state-of-the-art performance in implant-induced artifact segmentation, offering an efficient and reliable solution for automatic artifact evaluation in clinical settings.

Longtao Ma, Kaiyu Zhao, Siqi Gao, Lanyin Hu, Jintao Wei, Sui Huang, Yuan Li, Jiehua Ma, Hongjian He

Color and Multispectral Processing

Frontmatter
End-to-End Diffusion Models with Physics Priors for Enhanced Spectral Super-Resolution

Spectral super-resolution aims to reconstruct high- dimensional hyperspectral images from low-dimensional multispectral or RGB inputs, enabling rich spectral information recovery for downstream vision tasks. In this paper, we propose EDSSR, a diffusion-based framework that reconstructs high-quality hyperspectral images by modeling complex spectral variations through an end-to-end trained sampling process. EDSSR incorporates a pretrained diffusion model as a prior to guide the reconstruction and improve spectral consistency. Additionally, a Physics-Guided Module is introduced to inject physical constraints into the U-Net backbone, enhancing high-frequency detail recovery in the reconstructed spectrum. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our method in enhancing reconstruction accuracy and spectral fidelity.

Xinxin Li, Jianjun Liu
Asymmetric Dual-Teacher Guided Knowledge Distillation for HSI-SR with Reconstructed Features

Hyperspectral image super-resolution (HSI-SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts while preserving spectral integrity. Existing knowledge distillation (KD) methods predominantly transfer knowledge from a single super-resolution network, which limits the student model's ability to learn multi-stage hierarchical features. To overcome this limitation, we propose an asymmetric dual-teacher KD framework where two specialized teachers guide the student network: The super-resolution teacher network provides the knowledge of feature extraction, and the reconstruction teacher network provides the knowledge of feature reconstruction. Furthermore, we designed a Dual Aggregation Transformer U-net (DATU-Net) that is applicable to this framework and to hyperspectral super-resolution. The loss function designed enables the student network to focus on the knowledge of the two teacher networks, we verified the proposed network on two datasets and proved that our knowledge distillation framework is superior to the latest methods. The effectiveness of this framework was proved through ablation experiments.

Ziqi Zhang, Jianjun Liu
Gradient-Based Multi-focus Image Fusion with Focus-Aware Saliency Enhancement

Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover surveillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus information. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively preserve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the initial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary information across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. The source code is available at https://github.com/Lihyua/GICI .

Haoyu Li, Xiaosong Li
OME-Net: Optimization-Inspired Multi-domain Enhanced Network for Image Compressed Sensing Reconstruction

Traditional Compressive Sensing (CS) image reconstruction methods suffer from high computational costs and low reconstruction quality, so deep learning models are widely used to achieve non-linear projection for better reconstruction. Recently, CS unfolding networks can combine the advantages of deep learning and traditional optimization methods, but existing methods are still limited by single-domain information flow within unfolding networks, leading to information loss during image-to-image mapping. This paper proposes an Optimization-inspired Multi-domain Enhanced Network (OME-Net) based on multi-domain collaboration and frequency-domain enhancement. At each stage of OME-Net, there are two parts: the gradient descent module and proximal operator. Rather than only adopting pure gradient descent formula, the proposed OME-Net uses multi-domain gradient descent module to synchronously extract multi-domain information for feature complementarity. The proximal operator is approximated by frequency-domain guided multi-resolution reconstruction architecture that enhances features on Fourier domain and fuses features to retain high-frequency details. Experimental results show that the proposed OME-Net significantly outperforms several traditional methods and deep learning methods for image CS reconstruction task.

Ying Ma, Lijun Zhao

Compression, Transmission, Retrieval

Frontmatter
MARSNet: Scalable Deep Coding of LiDAR Point Clouds via Multimodal and Residual Learning

The rapid adoption of LiDAR sensors in autonomous driving has led to an explosion of LiDAR point cloud (LPC) data, posing substantial challenges for storage and transmission. To address the complexity of large-scale and spatially non-uniform LiDAR point clouds, we introduce a new multimodal and residual-driven scalable framework (MARSNet) for LiDAR point cloud compression (LPCC). Our MARSNet integrates an end-to-end deep network with a residual-aware compression module that leverages multiple modalities. By aligning and jointly encoding point cloud, depth, segmentation, and residual information, the proposed approach generates ultra-low-bit latent representations while preserving fine-grained geometric details. Extensive experimental validations show that MARSNet consistently outperforms 16 advanced LPCC models, achieving superior reconstruction quality at ultra-low bitrates.

Yanji Huang, Runnan Huang, Jianlong Zhou, Yingqi Zhuo, Yanshan Li, Miaohui Wang
Accelerating Learned Video Compression via Low-Resolution Representation Learning

Learned video compression achieves high compression ratios but often suffers from low speeds due to model complexity and high-resolution spatial operations. In this work, we propose an efficiency-optimized framework that emphasizes low-resolution representation learning to accelerate inference. Specifically, we reduce the resolution of reused inter-frame propagated features (including those from I-frames) and employ joint I/P-frame training to enhance feature interaction. Our method efficiently exploits multi-frame priors for parameter prediction with minimal additional decoding computation. Furthermore, we revisit the Online Encoder Update (OEU) strategy to boost compression performance without sacrificing decoding efficiency. Overall, our framework significantly improves the trade-off between compression efficiency and inference speed, achieving performance comparable to VTM-LDP. Compared to DCVC-HEM, it delivers a similar compression ratio while offering $$3\times $$ 3 × faster encoding and $$7\times $$ 7 × faster decoding, decoding 1080p frames in under 100ms on an RTX 2080Ti.

Zidian Qiu, Zongyao He, Zhi Jin
Optical Flow-Driven Fast CU Partition for Inter Prediction in Versatile Video Coding

The new generation video coding standard, H.266/VVC, introduces the quad-tree nested multi-type (QTMT) block partitioning structure and multiple inter coding modes, significantly improving coding efficiency but also increasing encoding time. To address this, we propose a coarse-to-fine fast partition decision (FPD) algorithm that collects both temporal and spatial information for inter CU partitioning. FPD first leverages co-located similarity between the current CU and its counterpart in the reference frame to capture global motion. High similarity indicates static regions, allowing early pruning of partition candidates. For CUs with low similarity, indicating complex local motions, we introduce a machine learning-based approach. Specifically, we extract temporal optical flow and spatial features (e.g., edges and gradients) to train a LightGBM classifier to predict the partition direction and skip the horizontal/vertical directions in advance. Experiments conducted under the common test condition of H.266/VVC demonstrate that our proposed FPD achieves a 37% runtime saving with only 0.99% coding performance loss, significantly surpassing state-of-the-art methods.

Junhao Jiang, Shuangxing Tian, Dandan Ding, Weiwei Xu
Semantic Maintained Video Compression by Background Blurring in Surveillance Scenarios

This paper proposes a novel surveillance video compression framework that does not modify the encoder and decoder. By introducing background blurring before encoding, the method significantly enhances compression efficiency and semantic regions’ signal. Background blurring is adopted to remove background texture and preserve signal of semantic regions(foreground) as a preprocessing step. This preprocessing yields higher compression gains and improves foreground signal simultaneously. This paper also presents the first quantitative model relating compression gain to ROI area ratio and blurring degree. It provides a theoretical basis for our approach's compression capability. Additionally, a video caching scheme is proposed to temporarily store original videos at the camera end. This enables lossless video retrieval in emergencies as a supply. Extensive experimental results are given and demonstrate our method's effectiveness and efficiency. At equivalent bitrates, the average PSNR of semantic regions increases about 1.22 dB. Our approach presents a simple but highly efficient solution for surveillance video compression without changing the encoder and decoder.

Wenpeng Cui, Xinwei Zheng, Hongming Zhang, Wei Zeng
Learning Based Fast Coding Unit Decision for Video-Based Point Cloud Compression

The Moving Picture Experts Group (MPEG) standardized Video-based Point Cloud Compression (V-PCC) is an emerging coding standard for 3D dynamic point clouds. V-PCC projects point clouds into geometry and attribute videos and uses Versatile Video Coding (VVC) as a video encoder to improve compression efficiency, but it also results in a huge coding complexity. To reduce the coding complexity, we propose a fast Coding Unit (CU) decision algorithm based on a Support Vector Machine (SVM) for VVC coding of geometric and attribute videos in V-PCC. First, we extract different features based on CU types, including texture feature, coding feature, inter feature, and geometric feature. Second, we trained discriminators for various sizes of CUs and selected different weight factor for each discriminator to achieve a trade-off between coding complexity and Rate-Distortion (RD) performance. The experimental results show that the proposed fast decision method reduces the complexity by 27.49% with Bjónteggard Delta Bit Rate (BDBR) of 1.23% and 1.36% compared to the anchor VVC-based V-PCC.

Lewen Fan, Yun Zhang

Computational Imaging

Frontmatter
Leveraging a Dual-Learning Methodology Based on Degradation Modeling and Fractional Fourier Image Transformer for Light Field Image Super-Resolution

Light Field Super-Resolution (LFSR) endeavors to reconstruct high-resolution (HR) light field images from their low-resolution (LR) counterparts by capitalizing on multi-view image information. This process not only enables a more efficient restoration of high-frequency details but also preserves the geometric structure of the scene. Nevertheless, prevailing methods encounter difficulties in capturing the long-range spatial and angular dependencies inherent in light field data, as well as their high-frequency spectral characteristics. Moreover, the availability of high-quality paired training data for real-world scenarios remains limited. To address these challenges, this paper presents DFFIT (Dual learning and Fractional Fourier Image Transformer), a novel LFSR framework that seamlessly integrates frequency-domain analysis with a dual-learning strategy grounded in degradation modeling. We introduce the Fractional Fourier Image Transformer (FrIT), which ingeniously combines the fractional Fourier transform (FrFT) with Transformer-based long-range dependency modeling. This integration effectively captures frequency-specific features while guaranteeing cross-view consistency. Additionally, our dual-learning framework generates a variety of LR training samples by emulating real-world degradation processes, thereby narrowing the domain gap between synthetic and real-world data. Experimental results verify that the proposed method exhibits remarkable performance in enhancing the resolution of light field images.

Haiyang Liu, Jian Ma, Sheng Chen, Dong Liang, Linsheng Huang, Rui Xu
Video Stabilization Based on MeshFlow Motion Model in Dynamic and Complex Scenes

Due to the influence of both internal and external factors, fixed installed surveillance cameras often suffer from shakiness. In dynamic and complex scenes, frequent discontinuous depth variations and large foreground moving objects lead to multi-plane motion. This can lead video stabilization algorithms to misjudge local plane motion as global camera shakiness, resulting in stabilization failure or degraded performance. To address this problem, we propose a video stabilization algorithm based on the MeshFlow motion model. First, we propose a shakiness detection method and rules, which enables the stabilization algorithm to process only shaky frames, thus improving computational efficiency. Then, during motion estimation, we divide each frame into multiple mesh and construct a sparse motion field using motion vectors from mesh vertices to extract the camera’s shakiness trajectory. Finally, we apply Kalman filtering for trajectory smoothing, and use motion compensation to generate stabilized video. Experimental results show that the stabilized video achieves a PSNR improvement of 30% over the original video, only 0.23 dB lower than the SOFT algorithm. Additionally, the processing speed reaches 32 frames per second, which is 78% faster than SOFT algorithm, thereby meeting the requirements of practical applications.

Jun Liu, Hao Ning, Jing Huang, Yingjie Xia, Qun Xie, Jun Zhou, Jinping Li
Dual-Edge Consistency Constrained Unfolding Network for Depth Map Super-Resolution

Recently, several newest Depth Map Super-Resolution (DMSR) methods have incorporated depth edge prediction as auxiliary guidance into the optimization model, creating a dual-task driven unfolding network for enhancing depth edge refinement. Nevertheless, these approaches either overlook explicit dual-edge consistency constraint or merely fuse color edges with depth edges once. This oversight results in diminished generalization ability and subpar performance. To this end, we transform the DMSR problem as a triple-task optimization model explicitly constrained by dual-edge consistency. According to the Alternating Direction Method of Multipliers (ADMM) theory, the proposed model can be cast as iterative sub-optimizations for color-edge update, depth-edge update, depth map update, and augmented Lagrange multiplier update. These sub-optimizations can be further unfolded into an interpretable ADMM network. Within this network, we integrate learnable modules into the initial pure formula expansion, enabling high-throughput information transmission and thereby enhancing the network’s representational power. A large number of experiments have demonstrated that the proposed method achieves better reconstruction results as compared with several DMSR methods.

Hao Ren, Lijun Zhao, Jinjing Zhang, Huihui Bai, Anhong Wang

Computer Graphics and Visualization

Frontmatter
Isotropic Remeshing with Inter-angle Optimization

As an important metric for mesh quality evaluation, the isotropy property holds significant value for applications such as texture UV-mapping, physical simulation, and discrete geometric analysis. Classical isotropy remeshing methods adjust vertices and edge lengths, which exhibit certain limitations in terms of input data sensitivity, geometric consistency control, and convergence speed. In this paper, we propose an improved isotropy remeshing solution with inter-angle optimization during mesh editing to enhance shape control capability and accelerate convergence. The advantage of the solution lies in its ability to predict the impact of edge length adjustments on subsequent optimization by monitoring angle transformations. It avoids inefficient editing that may cause performance fluctuations, thereby improving efficiency. Experiments demonstrate that the proposed method effectively improves the overall efficiency of mesh optimization. (The code has been released at Isotropic-Remeshing-InterAngle .

Hanbing Zheng, Chenlei Lv
AlignMR: Design of a Home Yoga Self Learning System Based on MR Technology

Yoga practice doesn’t require costly equipment, and learners can teach themselves by watching online video tutorials. However, to keep the screen in view during practice, learners often disrupt their movement balance, which affects execution and hinders learning. Moreover, improper yoga postures may cause negative effects. This paper introduces a home-based self-practice yoga system using Mixed Reality (MR) technology and the OpenPose framework to address disrupted movement postures during yoga learning. By providing an instructor video interface that tracks the user’s head movements, the system helps users focus on both the screen and their physical movements without distraction. It overlays real-time captured full-body user postures with instructional videos, allowing users to visually compare their movements with standard postures and correct errors, thus addressing the lack of feedback during self-practice. Research shows that this design enables learners to perform technical movements more smoothly and accurately, demonstrating its wide applicability and high practicality.

Huangyiming Wu, Tuotuo Yang, Xiaona Ma
Bi-IRNet: A Transformer-Based Binaural Impulse Response Generation Guidance Model

In the field of acoustic simulation, widely applied methods rely on the impulse response (IR) and its convolution relationships. However, most deep learning-based approaches for generating IRs are limited to monaural IRs. Some methods for generating binaural IRs require specialized binaural IR datasets, which are costly to collect and difficult to obtain under extreme conditions, such as underwater environments. Therefore, this paper introduces a low-cost and practical technique, Bi-IRNet, which guides various IR generation models to produce corresponding binaural IRs using positional information as input. Our method leverages transformer networks and the Head-Related Transfer Function (HRTF) database to train a binaural IR generation guidance module. This module can be easily embedded into other IR generation models, enabling end-to-end generation of spatially aware binaural IRs. With this module, IR generation models can produce spatial binaural IRs without the need for a binaural IR dataset, significantly reducing the cost of deep learning-based binaural IR generation.

Yisheng Zhang, Shiguang Liu
Backmatter
Titel
Image and Graphics
Herausgegeben von
Zhouchen Lin
Liang Wang
Yugang Jiang
Xuesong Wang
Shengcai Liao
Shiguang Shan
Risheng Liu
Jing Dong
Xin Yu
Copyright-Jahr
2026
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9533-98-5
Print ISBN
978-981-9533-97-8
DOI
https://doi.org/10.1007/978-981-95-3398-5

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH