Zum Inhalt

Image and Graphics

13th International Conference, ICIG 2025, Xuzhou, China, October 31 – November 2, 2025, Proceedings, Part II

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Der dreibändige Band bildet den Abschluss der 13. Internationalen Konferenz für Bild und Grafik, ICIG 2025, die vom 31. Oktober bis 2. November 2025 in Xuzhou, China, stattfand. Die 138 vollständigen Beiträge in diesem Buch wurden sorgfältig ausgewählt und aus 420 Einreichungen überprüft. Diese Vorträge wurden in die folgenden Themenbereiche unterteilt: Künstliche Intelligenz, Maschinelles Lernen, Computervision, Mustererkennung, Rendering, Bildmanipulation, Grafiksysteme und -schnittstellen, Bildkomprimierung, Formmodellierung, Biometrie, Szenenverständnis, Vision for Robotics, Szenenanomalie-Erkennung, Aktivitätserkennung und -verständnis, Featureauswahl.

Inhaltsverzeichnis

Frontmatter

Computer Vision and Pattern Recognition

Frontmatter
Unsupervised Image Restoration Using Domain Discriminator with Feature Disentangle

Due to difference of physical characteristics among various weathers, weather features and high-frequency signals remain in reconstructed image obtained by single restoration methods. The traditional single recovery methods are not suitable for multiple weather degenerate images. To address this limitation, this paper proposes an unsupervised image restoration network combined with feature disentangle technology. The multiple encoder-decoder network is firstly introduced to extract background information or weather features from degenerate images. Then, the domain discriminator and latent feature space consistency loss are used to constrain extracted background information and weather features, thus to assist feature disentangle and reduce residual weather features and high-frequency signals. Subjective and objective results show that the designed restoration network is suitable for various weather degenerate images while maintaining a competitive background image restoration effect.

Chengfang Zhang, Xusong Ran, Ziliang Feng
Cross-Domain Adaptation for Few-Shot 3D Shape Generation

Modern generative models learn from large-scale datasets and generate new samples following similar distributions. However, when training data is limited, deep neural generative networks overfit and tend to replicate training samples. Prior works focus on few-shot image generation to produce reasonable and diverse results using a few target images. Unfortunately, abundant 3D shape data is typically hard to obtain as well. In this work, we make the first attempt to realize few-shot 3D shape adaptation by adapting generative models pre-trained on large source domains to target domains. To relieve overfitting and keep considerable diversity, we propose to maintain the probability distributions of the pairwise relative distances between adapted samples at feature-level and shape-level during domain adaptation. Our approach only needs the silhouettes of few-shot target samples as training data to learn target geometry distributions and achieve generated shapes with diverse topology and textures. The effectiveness of our approach is demonstrated qualitatively and quantitatively under a series of few-shot 3D shape adaptation setups.

Jingyuan Zhu, Huimin Ma
SaliencyCLIP-SAM: Bridging Text and Image Towards Text-Driven Salient Object Detection

Many unsupervised salient object detection methods rely heavily on handcraft visual priors. Existing deep learning-based models require task-specific training and high annotation costs, limiting their generalization to complex scenes. In this paper, we propose a text-driven salient object detection framework by innovatively integrating the rich visual-language semantic information of Contrastive Language-Image Pre-training (CLIP) and the segmentation power of the Segment Anything Model (SAM), which is without explicit task-specific training and manually labeling. Specifically, to mitigate the negative impact of some ‘global’ patches in the final visual feature from the CLIP visual encoder, we propose a Multi-level Self Cosine-Similarity Correction model (MSCC), which calculates the cosine similarities of multi-level visual features for enhancing the local semantic correlation in saliency regions. With the modified final visual feature, we derive coarse salient regions. Then, we introduce a Multi-level Saliency Mask Refinement model, where coarse saliency maps from CLIP generate diverse prompt constraints (points, boxes, masks) for SAM, resulting in multi-level fine-grained saliency masks without manual intervention. Experimental results on public salient object benchmarks demonstrate the effectiveness of the proposed text-driven framework in segmenting salient objects, which provides empirical insights and key breakthroughs for leveraging foundation models in perceptual tasks through text prompt-based methods.

Ying Yuan, Yingying Zhang, Shuai Zhang, Hongjuan Wang
Efficient RGBT Tracking via Early Fusion and Hierarchical Knowledge Distillation

RGBT tracking aims to achieve robust tracking by comprehensively utilizing the features of visible and thermal infrared modalities. Existing methods achieve multimodal interaction and fusion by designing complex fusion modules. However, due to their adoption of intermediate fusion or late fusion strategies, these methods result in inefficient tracking, which may limit their application in real-time tracking scenarios. In this paper, we propose an efficient single-stream RGBT tracking framework based on Early Fusion and Adaptive hierarchical knowledge Distillation, termed EF-AD. We design an early fusion strategy to improve tracking efficiency and reduce model complexity. Additionally, an adaptive hierarchical knowledge distillation strategy is devised to ensure the tracking performance of the student model. In particular, we design a early fusion module for performing early fusion of multimodal features after the encoding layer. In addition, we record the intermediate layer features and response map features of the teacher network, and compute the tracking losses for the two modal branches separately as indicators for adaptive fusion. Subsequently, the teacher features are adaptively fused to guide the feature learning of the student network in various scenarios. Extensive experiments on two popular RGBT tracking datasets demonstrate that our method significantly reduces computational complexity and the number of parameters while only incurring a slight decrease in accuracy, achieving an inference speed of 72.6 FPS.

Jinhu Wang, Mai Wen, Zhang Zhang, Liang Wang, Chenglong Li
Human Pose Estimation Method Based on Top-Down View Fisheye Images

Fisheye cameras, with their ultra-wide field-of-view (FOV) characteristics, are highly valuable for applications such as intelligent surveillance requiring large-scale human monitoring. However, human pose estimation in top-down fisheye images faces two critical challenges: first, the radial distribution of human targets caused by wide-angle imaging demands multi-directional annotated data due to the lack of rotational invariance in traditional algorithms; second, the nonlinear distortion of fisheye cameras dynamically varies with spatial positions, creating strong correlations between geometric deformation and imaging locations that existing methods fail to model effectively. To address these limitations, we propose FishPoseNet, a distortion-adaptive pose estimation framework based on YOLOv8Pose. Our framework includes three core innovations: (i) a Geometric Anchor Alignment Module (GAAM) that standardizes human orientations via affine transformations, enabling full-scene directional coverage with single-direction annotated samples; (ii) a position-sensitive Dynamic Distortion Coefficient Estimator (DDCE) establishing continuous mapping from pixel coordinates to distortion coefficients; (iii) an enhanced Distortion-Aware YOLOPose (DA-YOLOPose) network that leverages distortion coefficients to guide feature fusion, improving adaptability to nonlinear deformations. Experimental results on a self-built top-down fisheye surveillance dataset demonstrate significant improvements, offering an innovative solution for high-precision human pose estimation in complex distortion scenarios.

Quanyuan Chen, Mingzhe Zhang, Bin Li, Yingjie Xia, Qun Xie, Jinping Li
On Leveraging Unlabeled Data for Concurrent Positive-Unlabeled Classification and Robust Generation

The scarcity of class-labeled data is a ubiquitous bottleneck in many machine learning problems. While abundant unlabeled data typically exist and provide a potential solution, it is highly challenging to exploit them. In this paper, we address this problem by leveraging Positive-Unlabeled (PU) classification and the conditional generation with extra unlabeled data simultaneously. We present a novel training framework to jointly target both PU classification and conditional generation when exposed to extra data, especially out-of-distribution unlabeled data, by exploring the interplay between them: 1) enhancing the performance of PU classifiers with the assistance of a novel Classifier-Noise-Invariant Conditional GAN (CNI-CGAN) that is robust to noisy labels, 2) leveraging extra data with predicted labels from a PU classifier to help the generation. Theoretically, we prove the optimal condition of CNI-CGAN and experimentally, we conducted extensive evaluations on diverse datasets.

Bing Yu, Ke Sun, He Wang, Zhouchen Lin, Zhanxing Zhu
Multi-view Captioning with Semantic Delta Re-ranking for Zero-Shot Composed Video Retrieval

Composed Video Retrieval (CVR) aims to retrieve video relevant to a query video while incorporating specific changes described in modification text. For Zero-Shot Composed Video Retrieval (ZS-CVR), current methods utilize vision-language models to convert the query video into a single caption, subsequently merged with modification text to generate an edited caption for retrieval. However, the modification text doesn’t clearly specify which elements to preserve from the query video, leading to possible misalignment between edited caption and target video. Additionally, the final retrieval result should not be determined solely by the similarity between edited caption and candidate videos but also incorporate the semantic delta arising from the modification text. To address these issues, we propose Multi-View Captioning with Semantic Delta Re-Ranking (MCSD) method for ZS-CVR. Specifically, the Multi-View Captioning Module to generate captions covering potential semantics of the target video, the Semantic Delta Re-Ranking Module that computes the semantic delta between the original and edited captions, to adjust similarity scores and re-ranks the retrieval results. Extensive experiments on two benchmarks demonstrate that the proposed MCSD method achieves state-of-the-art performance in ZS-CVR.

Zhixiang Ding, Lilong Liu, Zhenyu Yang, Shengsheng Qian
Feature Decoupling with Modality Modulation for Multimodal Sentiment Analysis

Multimodal sentiment analysis aims to extract and integrate meaningful information from diverse modalities to infer a speaker’s emotional state. Due to the inherent heterogeneity among modalities, most existing approaches decouple modalities into specific and invariant features, which partially capture cross-modal representations. However, in multimodal tasks, certain modalities often dominate the optimization process, leading to the under-optimization of weaker modalities. To address this imbalance, we propose the Modal Modulation Adaptive Fusion Network (MMAFNet), which enhances the learning of valuable information from each modality. Specifically, for modality-specific features, we introduce a gradient modulation strategy that dynamically adjusts learning rates to prioritize weaker modalities. For modality-invariant features, we employ a parameter reset strategy based on inter-modal distances to mitigate overfitting and strengthen feature extraction in underperforming modalities. Additionally, an adaptive fusion module combines modality-specific and invariant features according to their learned weights. Our comprehensive analysis of feature characteristics and tailored modulation strategies effectively alleviates modality imbalance. Extensive experiments on two benchmark datasets demonstrate the superiority of our approach.

Yongbo Wang, Jiaxiang Wang, Aihua Zheng, Wenjuan Cheng, Xiaofei Sheng
A Two-Stage Multimodal Remote Sensing Image Registration Network with Deformation-Refined Affine Transformation

Although remote sensing image registration has been widely studied, accurate registration of multimodal remote sensing images is still challenging due to the presence of geometric deformation and radiometric differences between images of different modalities. We propose a two-stage coarse-to-fine registration network based on the U-Net network architecture. The network considers both the registration of large-scale affine transformations and local registration using flow field prediction, using flow field to fine tune affine registration. A Swin-U-KAN network is proposed for affine deformation, which embeds KAN convolution and Swin transformer into a U-net network with encoder-decoder structure. A U-Net network with parallel convolution blocks is proposed for fine registration of flow field prediction. The proposed network is evaluated on SAR-optical and infrared-optical image pairs with large-scale affine deformation, and compared with the current state-of-the-art registration methods. Experimental results show that the proposed network has good registration effect.

Wenqing Wang, Kunpeng Mu, Wenhao Sun, Han Liu
FFTA-Net: A Frequency-Domain Fusion and Temporal Alignment Network for Transmission Line Defect Detection

Transmission line defect detection plays a critical role in ensuring the safe and stable operation of power systems. However, this task faces significant challenges due to the randomness of aerial image capture angles, large variations in object scales, complex backgrounds, and weak features of small objects. To address these issues, we propose FFTA-Net, a novel defect detection network based on Frequency-domain Feature Fusion and Temporal-domain Feature Alignment. In the frequency domain, we perform multi-scale fusion on low-level features from the FPN outputs to enhance the representation of small object features. Meanwhile, a specially designed align-loss function is introduced to align multi-level feature representations in the temporal domain, improving multi-scale consistency and detection accuracy. Additionally, a foreground-aware mechanism is integrated to strengthen foreground features while suppressing background interference, enhancing robustness under complex environments. Extensive experiments on a custom transmission line defect dataset demonstrate that FFTA-Net significantly improves detection accuracy and generalizes well across different object detection models.

Hao Tang, Jianxu Mao, Yaonan Wang, Junlong Yu, Junfei Yi, Zhenyu He, Ziming Tao, Hui Zhang
Hierarchical Transformer for Panoramic Image Inpainting with Comprehensive Attention Module

Recent studies have shown notable progress in panoramic image inpainting using deep learning-based methods, particularly Transformer-based approaches. However, conventional Transformer models mostly establish global relationships between features, which may overlook crucial local details such as texture, edges, and structures, essential for realistic image restoration in inpainting tasks. To address this limitation, we propose a novel Transformer architecture called the Hierarchical Transformer (Hi-former). Unlike conventional Transformers, the Hi-former is designed to help the model capture local and global relationships between features, enhancing the model’s overall structural consistency. In addition to the Hi-former, we introduce a Comprehensive Attention Module (CAM) to adaptively learn the significance of different channels and regions during the inpainting process. By dynamically allocating attention to relevant features, CAM enhances the model’s comprehension of the image, leading to improved accuracy and fidelity in inpainting results. Experimental results on the SUN360 dataset demonstrate the effectiveness of our approach, with an average increase of 3.04 dB in PSNR, a 0.0533 improvement in SSIM, and an average decrease of 1.52 in FID score compared to the SOTA method.

Li Yu, Yanjun Gao, Yihang Yin, Farhad Pakdaman, Moncef Gabbouj
Time-Frequency Domain-Based No-Reference Algorithm for Image Blurriness Evaluation

In high-definition imaging, solving the problem of out-of-focus blurring is one of the core challenges. For example, focus drift in telephoto lenses for distant targets affects subsequent analysis, necessitating a quantitative assessment of blur. Existing methods have drawbacks: subjective assessment lacks accuracy; reference-based objective methods rely on original, clear images, often unavailable in practice; feature-based no-reference methods have limitations and may misjudge complex images. For instance, EMBM is less sensitive to weak edges. Thus, a time-frequency domain no-reference assessment algorithm is proposed, with core innovations: first, a multi-scale feature extraction model integrating time-domain and frequency-domain features to comprehensively capture edge information across dimensions; second, a principal component analysis feature optimization module for dimensionality reduction and redundancy removal, enhancing key feature representation; finally, a dynamic weight allocation mechanism that specifically increases weak edge feature weights, solving EMBM weak edge neglect. Tests on the TID2013 dataset show that its SROCC index is $$1.39\%$$ 1.39 % and $$2.44\%$$ 2.44 % higher than that of EMBM, respectively.

Jing Huang, Hao Ning, Yingjie Xia, Qun Xie, Jinping Li
Boosting Portrait Matting with Spatial-Frequency Harmony

Automatic portrait matting methods aim to accurately separate foreground portraits by predicting alpha mattes without auxiliary guidance. Recent portrait matting methods only rely on spatial features for feature extraction, which leads to detail loss and background noise interference. To address these issues, we propose an automatic portrait matting method called HarmonyMatting, based on the harmony of frequency-spatial domain information. Firstly, we employ a Spatial-Frequency Pyramid module (SFP) with a multi-scale frequency band decomposition strategy to enhance the model’s multi-granularity perception of the foreground target. Secondly, we design a Harmony Transformer Block (HTB) to model semantic features in the spatial domain while realizing background noise suppression in the frequency domain. Thirdly, we introduce a Cross-domain Feature Reconstruction module (CFR) to facilitate bidirectional information transfer between the frequency and spatial domains, reducing the detail loss during upsampling. Extensive experiments on Human-2K and P3M-10k datasets demonstrate that HarmonyMatting can effectively suppress background noise interference while preserving rich fine details.

Rongsheng Luo, Changxin Gao, Nong Sang
: Extending Classifier-Free Guidance in Diffusion Models for Real Image Inversion

The latest advancements in text-guided diffusion models have revealed powerful image processing capabilities. However, applying these methods to real images requires inverting the images into the domain of diffusion models. The accuracy of this inversion process significantly impacts the final editing results and the preservation of the core content of the source image. Achieving faithful inversion while preserving the inherent suppression capability of diffusion models remains a challenge, especially when the image contains intricate details. Existing reconstruction methods have made strides, but they still fail to capture the precise spatial context and preserve the inherent suppression capability of diffusion models. In this paper, we introduce a more accurate inversion technique, $$w+$$ w + , that enables faithful reconstruction of real images. Moreover, $$w+$$ w + intuitively extends the inherent ability of diffusion models to perform suppression on real images using negative prompts—a capability not achieved by existing reconstruction methods. Compared to state-of-the-art inversion techniques, our $$w+$$ w + inversion, based on the publicly available Stable Diffusion (SD) model, is extensively evaluated for image inversion and extends inherent suppression capability of SD to real images. Our code will be publicly released.

Kaihua Li, Yang Cheng, Senmao Li, Yuhan Liu, Yaxing Wang, Boqian Li, Gen Xu, Wanming Hao
Self-supervised Surface Defect Inspection Method via Alignment of Content and Style

Deep learning surface defect inspection has demonstrated remarkable potential in industrial quality control. Most existing works transfer knowledge to defect inspection tasks. This knowledge is acquired from natural datasets by self-supervised learning (SSL). However, these methods face dual challenges: (1) data scarcity in defect datasets leading to unstable training and overfitting, and (2) the minimal intermediate domain between natural and defect domains, which results in a significant domain gap in content and style. To overcome these challenges, this paper proposes AlCS, a self-supervised surface defect inspection method via alignment of content and style. AlCS achieves content and style alignment between defect and natural domains through style transfer data augmentation, enhancing model generalization and stabilizing the training process. Our approach consists of two parts. First, the Style Transfer Module (STM). It constructs samples that exhibit intermediate domain characteristics. The samples enlarge the intermediate domain to bridge the domain gap. Second, a content-style domain adaptation architecture that includes the Content Alignment Module (CAM) and the Style Alignment Module (SAM). It decouples and hierarchically aligns the content and style features of the constructed samples. Experiments on several industrial defect datasets demonstrate that our approach outperforms existing SSL methods in surface defect inspection tasks.

Guojian Ye, Zeyu Zhang, Biaohua Ye, Jianhuang Lai
Improved UAV Aerial Vehicle Detection Algorithm Based on YOLOv11n

In order to solve the problems of low detection accuracy and poor real-time performance caused by small size, dense distribution and complex background of UAV aerial vehicle detection in UAV aerial vehicle detection scenes, this paper proposes an improved UAV aerial vehicle detection algorithm based on YOLOv11n. Firstly, a detection head was added to the high-resolution feature layer of the YOlOv11n backbone network to reduce the problem of small target information loss caused by the reduction of resolution after downsampling, and improve the detection accuracy of small target. Secondly, the Inner-IoU Loss is used to replace the traditional IoU Loss, and the auxiliary box is used to accelerate the convergence of the regression process. In addition, the Multi-scale Attention Aggregation Module (MSAA) is introduced to fuse multi-scale feature information by using spatial and channel attention mechanisms, which can improve the effect of multi-scale spatial and channel fusion while reducing background interference. Experiments show that the improved algorithm achieves a of 38.302% on the VisDrone dataset, which is 5.735% higher than that of the benchmark model.

Ke Zeng, Wangsheng Yu, Xianxiang Qin, Jinling Han, Zhiqiang Hou, Sugang Ma
NSFM: Normalization-Based Style Feature Elimination Method for Remote Sensing Scene Classification

In remote sensing scene images, the complexity of details amplifies intra-class diversity and inter-class similarity. Cross-scene classification enhances the model's discriminative capability by learning cross domain data correlations, thereby effectively mitigating the impacts caused by inter-class similarity and intra-class diversity. Domain generalization (DG) as a pivotal technique in cross-scene classification, has been widely studied by scholars due to its strong applicability. In this paper, a normalization-based style feature elimination method is proposed. The original features and normalized features are decomposed into their respective style and content components. Content features typically play a decisive role in classification tasks, while style features are slightly redundant, so we set the elimination of style features as our goal. However, style removal inevitably incurs content alteration due to the ill-defined boundary between content and style. We address this challenge by frequency-domain analysis based on the Discrete Fourier Transform, where amplitude and phase are explicitly decoupled into style and content features through mathematical analysis. The adaptable network eliminates style features to adjust the style-to-content ratio in images for learning domain-invariant features. Experimental results and systematic analyses validate the effectiveness of the pro-posed method in DG for scene classification tasks.

Lifan Ji, Jianjun Liu
Hyperspectral Image Super-Resolution via Degradation-Aware Learning and Frequency-Domain Feature Enhancement

The hyperspectral image (HSI) super-resolution task aims to reconstruct high-quality HSI from low-resolution hyperspectral images and high-resolution multispectral or RGB images. However, existing methods typically assume fixed spatial degradation, overlooking the complex degradation processes in real-world scenarios, which limits model generalization. To address this, we propose a novel semi-blind HSI fusion framework based on spatial degradation awareness and Spatial-Frequency enhancement. Specifically, contrastive learning is integrated into the network to extract degradation representation from low-resolution HSI with unknown spatial degradation, enabling accurate estimation and compensation of spatial distortions introduced during imaging. Moreover, a cross-domain feature enhancement strategy is employed to combine spatial details from high-resolution images with those recovered via degradation-aware processing in both spatial and frequency domains, enhancing the representation of fine-grained structures. Experimental results on multiple benchmark datasets demonstrate that the proposed method achieves superior performance in handling unknown spatial degradation.

Lian Cheng, Jianjun Liu
Nighttime Object Detection with Contextual Auxiliary Learning

Nighttime object detection is challenging because the low-light illumination and strong noise usually degrade the performance of feature learning. Some studies attempt to improve object detection at nighttime by enhancing the visual quality of input images, but this does not always improve the performance. In this work, we propose a simple strategy to improve the performance of nighttime object detection by directly strengthening feature representations of nighttime scenes with multiple context-related auxiliary learning tasks. Firstly, separate object localization and categorization tasks are used to refine the scene’s spatial information and object features. Meanwhile, a co-occurrence prediction task is designed to capture context relationships among objects. Finally, we employ a low-pass noise filter module to alleviate noise interference in feature learning. Experimental results, evaluated on nighttime and rainy-night scenes, demonstrate that the proposed method significantly improves the performance of nighttime object detection when used with typical object detection frameworks.

Xiangrui Hu, Xian-Shi Zhang, Yong-Jie Li, Kai-Fu Yang
Learning Supplementary Information for First-Person Perception Referring Expression Comprehension

Referring expression comprehension (REC) from a first-person perspective plays a crucial role in embodied intelligence applications such as assistive vision systems and mobile robotics. This task seeks to accurately localize target objects in egocentric visual scenes based on natural language descriptions. However, when confronted with complex visual scenes and lengthy descriptions, previous REC methods tend to focus exclusively on the primary target. They overlook the rich auxiliary semantic cues in the expression, resulting in suboptimal localization performance. To address this limitation, we propose a novel weakly supervised framework, termed Learning with Supplemental Information (LSI), that leverages unlabeled auxiliary object information in textual descriptions to enhance localization accuracy. Specifically, we utilize a pre-trained visual grounding model to generate pseudo-labels for auxiliary objects and model their visual and semantic relationships with the main referent. These pseudo-supervised signals help the model build a richer contextual understanding by aligning auxiliary and primary cues. Additionally, contrastive learning is introduced to enhance the discriminability of auxiliary information. Experimental results on the RefEgo dataset validate the effectiveness of our approach, with improvements of 2.7% in mAP@50 and 3.0% in mIoU.

Zetao Du, Jianhua Yang, Yan Huang, Liang Wang, Feng Chen, Zhepeng Wang
MFMamba: Multiple Fusing Mamba Network for Hyperspectral Image Pansharpening

Hyperspectral image pansharpening aims to restore high-spatial-resolution hyperspectral data by integrating the fine spatial details of a panchromatic image with the rich spectral information from a low-resolution hyperspectral cube. Existing methods typically perform feature fusion either at low resolution or during upsampling, which often leads to incomplete multi-level feature integration. Moreover, the high dimensionality of hyperspectral data—with its numerous spectral bands—demands abundant feature channels, yet most prior works fail to fully exploit these channels, resulting in underutilized spectral information and suboptimal fusion. To address these challenges, we propose MFMamba, a novel network employing optimized multi-stage feature fusion. At low resolution, MFMamba extracts and fuses features across multiple depths; during reconstruction, it further integrates high-resolution features, ensuring thorough utilization of both spectral and spatial cues. At the core of our architecture is the HyperMamba Block, a tri-branch module designed to disentangle and refine spectral and spatial features. Additionally, we incorporate a channel attention mechanism within each block to uniformly enhance informative responses across all channels, mitigating incomplete channel activation. Extensive experiments on three benchmark hyperspectral datasets demonstrate that MFMamba outperforms state-of-the-art pansharpening methods in both quantitative metrics and visual fidelity.

Zihao Wang, Xiuyi Jia
Point Density Fusion for Multimodal 3D Object Detection

In recent years, with the rapid development of autonomous driving and intelligent transportation, 3D object detection technology has gradually become a research hotspot. However, when the target objects are small or obstructed, the point cloud data obtained by radar is often insufficient, resulting in incomplete structural and semantic information, leading to missed detections and false detections. To address this challenge, this paper proposes a multimodal 3D object detection method based on point density fusion (Point Density Fusion for Multimodal 3D Object Detection, PDFusion). The algorithm introduces a Point Density Information Fusion Module (PDIF) and a Adaptive Gating Circuit Fusion Module (AGCF), utilizing point cloud density features and an adaptive gating fusion module to effectively leverage the geometric features of point clouds and the semantic information from camera images. This approach improves detection accuracy for small and obstructed objects. The AGCF module uses a self-attention mechanism to achieve efficient fusion of multimodal features, while the PDIF module further optimizes candidate boxes based on point cloud density information. Experimental results show that the proposed model achieves excellent performance on the KITTI validation set, particularly in Hard samples, where PDFusion reaches 3D detection accuracies of 85.95%, 60.11%, and 71.44% for the car, pedestrian, and bicycle categories, respectively. Additionally, the model demonstrates significant advantages on the Easy and Moderate samples of the KITTI test and validation sets, confirming its effectiveness and generalization in enhancing multimodal 3D object detection accuracy.

Ziyang Peng, Jianxu Mao, Wei He, Caiping Liu, Zhenyu He, Ziming Tao, Yaonan Wang
Temporal Reordering for Video Person Re-identification Based on Feature Reappearance Score

This paper proposes a novel model, named Temporal Reordering Vision Transformer (TRViT) based on feature reappearance score for video person re-identification. The current methods do not assess whether redundancy or interference is present in the video, and treats all the videos in a same way, which only handles one issue but ignores the other. To address the problem, this paper proposed a novel metric, Feature Reappearance Score (FRS), to quantitively evaluate the redundancy degree of sample videos, and determine the sample should be treated in a redundancy-focused measure or disturbance-focused measure. Further, in order to provide a unified solution to both issues, we propose the temporal reordering method. Through reordering the sequence according to different criteria, we can emphasize the extraction of distinctive or common features in the video. Our method is evaluated on two widely used and challenging datasets. The experimental results show that it outperforms the state-of-the-art methods.

Jing Wang, Bingpeng Ma
Multimodal Consistency-Driven Deepfake Detection

The emergence of multimodal deepfake videos demands detection systems that address synchronized audio-visual manipulations. We present AVMCD, a novel framework combining dual-stream transformer architecture with cross-modal consistency verification. Our solution introduces three key innovations: (1) Joint spatiotemporal modeling using Video Vision Transformers (ViViT) for facial dynamics and Audio Spectrogram Transformers (AST) for spectral speech patterns; (2) A synchronization analysis module employing cross-module features to detect audio-visual temporal mismatches; (3) Hybrid learning integrating one-class classification with multi-task consistency constraints. The framework overcomes critical limitations in existing approaches by explicitly modeling cross-modal interactions while preventing overfitting to single-modality artifacts. Comprehensive evaluations on FakeAVCeleb demonstrate state-of-the-art performance with 96.1% ACC, surpassing leading audio-visual methods by 12.4% ACC improvement. This work establishes a new paradigm for multimodal deepfake detection through systematic integration of transformer-based feature fusion and physiological consistency verification.

Li Zhang, Bin Liu, Qi Chu, Nenghai Yu
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

Recently, large-scale pre-trained models such as Segment-Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP) have demonstrated remarkable success and revolutionized the field of computer vision. These foundation vision models effectively capture knowledge from a large-scale broad data with their vast model parameters, enabling them to perform zero-shot segmentation on previously unseen data without additional training. While they showcase competence in 2D tasks, their potential for enhancing 3D scene understanding remains relatively unexplored. To this end, we present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. We then project these mask predictions from various frames of RGB-D video sequences into 3D space. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting. We examine diverse scenarios, like zero-shot learning and limited guidance from sparse 2D point labels, to assess the pros and cons of different vision foundation models. Our approach is experimented on ScanNet dataset for 3D indoor scenes, and the results demonstrate the effectiveness of adopting general 2D foundation models on solving 3D point cloud segmentation tasks.

Shichao Dong, Fayao Liu, Rui Yao, Guosheng Lin
Video Domain Incremental Learning for Human Action Recognition in Home Environments

It is significantly challenging to recognize daily human actions in domestic settings due to the diversity and dynamic changes in unconstrained home environments. It spurs the need to continually adapt to various users and scenes. Fine-tuning current video understanding models on newly encountered domains often leads to catastrophic forgetting, where the models lose their ability to perform well on previously learned scenarios. To address this issue, we formalize the task of Video Domain Incremental Learning (VDIL), which enables continual learning across domain shifts while maintaining a fixed set of action classes. While most continual learning research focuses on class-incremental settings, domain-incremental learning remains underexplored in video understanding. In this work, we introduce a benchmark for domain-incremental human action recognition in dynamic home settings, comprising three domain splits: user-based, scene-based, and hybrid. We also propose a simple yet effective baseline that combines replay and reservoir sampling without access to domain labels, designed for memory-constrained, task-agnostic scenarios. Extensive experiments show that our approach consistently outperforms existing continual learning methods across all benchmark.

Yuanda Hu, Jiani Hou, Xing Liu, Xiaohua Sun, Weiwei Guo
Wavelet-Based Distillation with Structured Frequency Alignment

Feature-level knowledge distillation is a promising approach for compressing object detectors. It transfers intermediate representations from high-capacity, computationally intensive teacher models to lightweight students with lower inference costs. However, existing methods predominantly emphasize spatial and channel-wise alignment, while largely overlooking the frequency characteristics that can be derived from deep features via spectral decomposition. In this work, we present a frequency-aware knowledge distillation approach that enhances existing frameworks through the integration of wavelet-based feature alignment. Specifically, teacher features are decomposed into multi-level subbands using two-dimensional Haar wavelet transforms, enabling the student to perform subband-wise alignment that captures both high-frequency details and low-frequency semantic cues. Additionally, a relational distillation mechanism operating on the principal subband is employed to model global dependencies and enhance semantic consistency. To evaluate the effectiveness of the proposed approach, we conduct experiments on the Inspection of Power Line Assets Dataset (InsPLAD), a real-world UAV dataset for transmission infrastructure inspection, using various detectors including both two-stage and one-stage models. Building upon the FGD framework, our method yields distilled lightweight student models that consistently achieve 1%–3% gains in average precision (AP) across various detectors. Furthermore, it demonstrates superior detection performance, particularly in handling elongated and structurally complex objects that typically pose challenges for compact models.

Pengyu Lu, Junfei Yi, Jianxu Mao, Junlong Yu, Shuohao Xiao, Zhenyu He, Yaonan Wang
A Vessel Extraction Method Based on Bilinear Factor Matrix Norm RPCA and TSRG

In X-ray coronary angiography (XCA), accurate vessel extraction is critically important for the diagnosis of coronary artery disease. However, this task remains highly challenging due to the complexity of background structures and the presence of varying motion patterns with different intensities. In this work, a novel method is proposed for vessel layer extraction based on bilinear factor matrix norm robust principal component analysis and a Laplacian regularization is introduced to enhance the separation effect. For vessel layer images with uneven contrast distribution, we use a two-stage region growing (TSRG) method for vessel enhancement and segmentation. A region growing method is first applied to extract the main branches. Subsequently, an RLF filter is utilized to enhance and reconnect small fragmented segments. These two intermediate outputs are then combined to construct the final binary vessel mask. The proposed method has been validated on both clinical XCA image sequences and a publicly available third-party dataset. Qualitative and quantitative results demonstrate that the proposed approach outperforms several state-of-the-art methods in terms of both accuracy and robustness.

Qingwen He, Nannan Zhai, Tongwei Lu, Jinjia Wang
A Progressive Approach to Learn Global and Local Multi-view Features for 3D Visual Grounding

The 3D visual grounding task aims to localize objects in point clouds based on natural language descriptions, playing a significant role in various domains such as autonomous driving and augmented reality. In this task, the view inconsistency between textual observation perspectives and point cloud viewpoints causes view confusion problems that hinder the model’s ability to accurately localize target objects. To address this issue, this paper proposes a progressive multi-view feature approach to supplement point cloud information from different perspectives, which includes sequential-form global multi-view features and vector-form local multi-view features. This method progressively learns multi-view point cloud features within the model while designing explicit interaction between object relative positions and textual descriptions to enhance the model’s comprehension of multimodal information. Furthermore, we introduce a selective state space model as the learning module for sequential global multi-view features, which improves model accuracy while reducing memory consumption and training time. Experimental results demonstrate that the proposed method achieves superior performance over existing state-of-the-art approaches on public datasets.

Ken Yang, Sanyuan Zhao
HG-Ghost: A Lightweight and Accurate Pose Estimation Network for Biometric Recognition

As a key technology capable of extracting structural and behavioral features of the human body, pose estimation is increasingly becoming an important addition and development direction in the field of biometrics. The current human pose estimation network faces two major problems in practical applications: first, the number of parameters is large and the computational overhead is high; second, the detection accuracy is insufficient, especially in complex scenarios such as multi-people, which can easily lead to key point localization bias due to occlusion. Aiming at the above two aspects, in this paper, we design a high-precision and lightweight deep learning network, HG-Ghost, and at the same time, in order to solve the occlusion problem and enhance the network’s ability to capture the region of interest, we introduce CA, and furthermore, an online convolutional reparametrization module is integrated into the Neck structure of the network in order to reduce the model’s memory occupation and training cost. Experimental evaluations on the COCO2017 and Human3.6 m datasets show that compared with the baseline model YOLOv8-Pose, our model reduces the number of parameters by 25%, reduces the amount of floating-point operations by 18%, and improves the accuracy by 4.5 percentage points, which fully demonstrates the significant advantages of the improved model in this paper in the field of pose estimation.

Yue Hu, Yongji Liu
AMFlow: Efficient Optical Flow Estimation via Attentional Cost Volume and Matching Initialization

Optical flow, the process of predicting motion fields from image sequences, is a fundamental problem in computer vision. While recent methods (represented by RAFT) leveraging 4D cost volume and iterative refinement have achieved state-of-the-art performance, two critical limitations persist: (i) the simplistic construction method for 4D cost volume fails to preserve semantic information in images, and (ii) the zero-initialized flow estimation necessitates excessive iterations (typically 32 iterations). To address these challenges, we propose AMFlow, a novel framework that integrates an attention-based cost volume and matching initialization. Specifically, AMFlow improves accuracy and efficiency by refining the initial cost volume and leveraging high-quality initial flow fields, achieving competitive or superior results with only 8 iterations, compared to RAFT-style models requiring 32 iterations. Additionally, extensive experimental results show our method maintains a balance between accuracy and computational efficiency regarding both inference time and GPU memory usage. On the Sintel benchmark, our method achieves competitive performance, with 1.19px and 2.67px average end-point error (AEPE) on the clean and final passes.

Ankang Sun, Jinglun Shi, Jiaxuan Lin, Beibei Liu, Guangjun Liao
Fighting Detection Based on Individual Keypoints’ Motion Trajectories and Motion Direction Entropies

Fighting detection helps maintain order and safeguard the lives of people in public. Using video surveillance to detect fighting in public places is a common and effective solution. Conventional detection methods focus on analyzing the angles between joints, striking, kicking actions, etc., of individuals engaged in fighting. These methods predominantly capture the action features of individuals, without considering the variation patterns of the motion trajectories during fighting. Moreover, this research faces a significant challenge: many non-fighting behaviors such as hugging, running and dancing may visually present similar action features to fighting. This similarity makes it difficult to accurately differentiate between fighting and normal behavior in practical applications. In contrast, during fighting, individuals exhibit intense motions and keypoints’ motion trajectories lack obvious regularity, which provides higher discriminability compared to other behaviors. Based on this, we proposed an effective and practical detection method that analyzes the motion trajectory patterns of different individuals’ keypoints over a period of time to identify the distinct patterns between fighting and normal behaviors. The steps are as follows: firstly, YOLO-Pose is employed to estimate the human pose, obtaining bounding box and individuals’ keypoints, in which centroid of the bounding box is considered one of the keypoints; secondly, DeepSORT is utilized to track individuals, acquiring the motion trajectories of keypoints; thirdly, using variance to identify the individual with abnormal speed variations in keypoints’ motion; finally, the entropy of motion directions of abnormal individuals’ is used to detect fighting behavior. The algorithm achieved 91.1% accuracy and 91.4% specificity on the RWF-2000 dataset, 95.9% accuracy and 95.7% specificity on our custom dataset, whereas the rate of false detection is much lower.

Guohua Xin, Hongbao Shi, Bin Li, Quanyuan Chen, Qun Xie, Jinping Li
Design and Experiment of Online Non-destructive Quality Testing and Grading Device for Chicory Sprouts

In response to the current shortcomings of low efficiency in artificial quality grading of chicory sprouts, high labor costs, and difficulty in meeting the requirements of industrialized chicory sprout production. A set of online non-destructive quality testing and grading device for chicory sprouts has been designed, which includes a visual detection module, a weighing detection module, and a grading module. Select YOLOv8n as the detection network and optimize it, including 1) replacing Backbone with MobileNetV3 to achieve lightweighting, 2) introducing CBAM attention mechanism, and 3) optimizing the loss function. Experimental verification was conducted on the optimized network, and the accuracy of detecting surface defects on chrysanthemum sprouts reached 92%; According to the sorting speed of 5/s, the whole machine was tested and verified, and the average grading accuracy can reach 93.5%. The whole machine has the advantages of simple structure, low cost, and high grading efficiency, and can be used for factory production of chicory sprouts.

Chunrui Ren, Qiang Chen, Wei Li, Yongjia Yang, Rongsheng Wei, Zhihuan Zhao, Shenghui Fu, Shuangxi Liu
Exploring Implicit Relations for Fine-Grained Generalized Category Discovery

In practical application scenarios of generalized category discovery (GCD), the data is often fine-grained, e.g., assisting biologists in classifying newly captured insect photographs. However, most existing GCD methods don’t focus on the subtle variance among fine-grained data and design models based on this characteristic. Motivated by this, we propose a new negative relation steering (namely NegReS) method for fine-grained GCD, which can better capture the subtle variance by exploring implicit relations among fine-grained data. Considering that the generic positive information is not available or reliable, we turn to excavate negative but effective relations, i.e., semantic-wise and instance-wise negative relations, and take full advantage of them to enhance the discriminative ability of the parametric classification in the GCD model. Specifically, semantic-wise negative relations among spatiality-related attention maps are employed to help capture more discriminative regions. Meanwhile, instance-wise negative relations among unlabeled data are excavated to guide the classifier optimization. Extensive experiments on several benchmarks demonstrate the effectiveness of our method.

Jiexi Yan, Xinyi Cheng, Chenghao Xu, Cheng Deng
SPViT-FER: A Sparse Pruning Based Vision Transformer for Facial Expression Recognition

Vision Transformer (ViT) achieves advanced results of facial expression recognition (FER), and consistently surpasses the convolutional neural network based methods. However, most ViT based methods calculate self-attention among all patches within entire facial image, resulting in expensive computational cost and non-task-specificity. Moreover, existing ViTs are implemented at single-scale and single-dimension, which have some intrinsic defects in multi-scale and cross-dimension feature extraction. To address these issues, this paper proposes a sparse pruning ViT based FER (SPViT-FER), which contains two core components, including Landmarks Guided Token Pruning (LGTP) block and Inter-Channel Cross-scale Self-Attention (ICCSA) block. Specifically, the LGTP block uses facial landmarks as guidance to activate the informative patches and discard the uninformative ones, which can save computational cost dramatically without performance reduction. The ICCSA block is carried out through inter-channel self-attention and Query Cycle Shift Operation (QCSO), which can capture the multi-scale long-distance dependencies among channels. The experimental results on several benchmarks demonstrate the superiority of SPViT-FER.

Hao Qin, Haitao Yin
Deciphering the Visual Style of China’s Hit Short Videos Through Computer Vision

The rise of short video platforms underscores the pivotal role of visual style in communication. This study aimed to addresses three key challenges: (1) the absence of standardized quantitative frameworks for visual style, (2) limitations in single-platform studies, and (3) imprecise metrics for body language analysis. Integrating film theory with computer vision, we establish a quantitative framework analyzing three dimensions: filming style, body language, and emotional expression. We apply this framework to analyze the impact of visual style on communication effectiveness across 783 popular short videos from China’s three largest platforms—Douyin, Kuaishou, and Bilibili. Results show that jittery frames (CP = 0.1424) and surprise elements (CP  = 1.0067) enhance popularity on Bilibili, quick editing (CP = 0.7060) is essential for Douyin, and close-ups (CP = 0.2549) significantly impact Kuaishou. However, physical activity in short videos has little effect on popularity on these platforms (|r| ≤ 0.135). This study extends film theory to digital media and offers content creation and communication strategies for Chinese propagandists and short video creators.

Qinglan Wei, Shenlian Xiang, Chen Zhang, Xiaohui Yang, Long Ye
A Lightweight and Real-Time Asymmetric Multi-output Thermal Radiation Effects Correction in Infrared Images

Infrared imaging systems are highly susceptible to thermal radiation effects in complex environments, degrading optical imaging quality and reducing target contrast. Traditional correction methods, including iterative optimization based on physical models and deep learning approaches, face challenges in real-time applications due to high computational complexity and increased network depth. This paper proposes a lightweight and real-time deep learning model for thermal radiation correction. The model adopts an asymmetric encoder-decoder structure with a multi-output design to balance efficiency and real-time performance. A lightweight residual block (LRB) enhances feature learning while minimizing computation, while inter-stage feature fusion and channel-point attention modules improve multi-scale feature integration. Experimental results on PSNR, SSIM, MAE, and inference speed demonstrate that the proposed method outperforms existing approaches in computational efficiency and correction accuracy, making it highly suitable for real-time infrared imaging applications.

Bo Fu, Dongming Xie, Yuanxin Li, Yifan Guo, Xinyuan Deng, Yaozong Zhang, Yu Shi
SNN-PAR: Energy Efficient Pedestrian Attribute Recognition via Spiking Neural Networks

Artificial neural network based Pedestrian Attribute Recognition (PAR) has been widely studied in recent years, despite much progress, however, the energy consumption is still high. To address this issue, in this paper, we propose a Spiking Neural Network (SNN) based framework for energy-efficient attribute recognition. Specifically, we first adopt a spiking tokenizer module to transform the given pedestrian image into spiking feature representations. Then, the output will be fed into the spiking Transformer backbone networks for energy-efficient feature extraction. We feed the enhanced spiking features into a set of feed-forward networks for pedestrian attribute recognition. In addition to the widely used binary cross-entropy loss function, we also exploit knowledge distillation from the artificial neural network to the spiking Transformer network for more accurate attribute recognition. Extensive experiments on three widely used PAR benchmark datasets fully validated the effectiveness of our proposed SNN-PAR framework. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR .

Haiyang Wang, Qian Zhu, Mowen She, Yabo Li, Haoyu Song, Minghe Xu, Jin Tang, Xiao Wang
GramFormer-Based Crowd Counting with Learnable Fourier Encoding and Attention Mechanism

In the problem of crowd counting, accurately estimating the number of people in complex scenarios remains a challenging task. In this paper, we propose GramFormer-based architecture with a learnable Fourier encoding and sliding window attention mechanism for crowd counting. It aims to improve the flexibility of spatial position encoding and the diversity of input features. Specifically, we integrate learnable Fourier features into Gramformer’s embedding layer for multi-dimensional spatial position encoding. This approach allows the model to learn optimal frequency parameters in a data-driven manner, reducing the need for manual frequency and scale adjustments. Additionally, the input feature map passes through a sliding window attention module, which captures local variations in the density of people, retains local details, and integrates global context. Finally, local features are fused with the global learnable Fourier features to enhance the input to the GramFormer. To demonstrate the superiority of the proposed method, the performance comparison between our method and state-of-the-art methods is conducted on four crowd counting databases. The results demonstrate that our method outperforms competing methods in terms of MSE and MAE.

Wenjie Xia, Yehao Gu, Wenqian Jiang, Xiaohua Huang
TEI-Face: A Temporal Expression and Identity Stability Oriented Face Swapping

Person-agnostic face swapping has gained significant attention in recent years. It is expected to address several challenges, such as effectively handling diverse real-world scenarios and broadening its applicability across various domains. However, expression and pose information in facial images is difficult to decouple from identity information and is susceptible to interference from the background. Contemporary algorithms struggle to maintain the temporal consistency of expressions and identities. Analysis of this issue reveals that, in face swapping scenarios, background information in dynamic videos exhibits less variation than facial regions and can be more readily replaced. In this paper, we propose a Temporal Expression and Identity stability oriented Face swapping method (TEI-Face), which reformulates face swapping into two subtasks: motion transfer and background replacement. By employing a face reenactment model as the backbone, we design a background correction module to perform background feature alignment and warping, and integrate it with the driven source face. In addition, a cycle-consistency verification network is employed to implement a self-supervised procedure that ensures identity consistency. Experiments on mainstream benchmarks FF++ and VFHQ demonstrate that TEI-Face achieves state-of-the-art face swapping results in terms of both identity and expression consistency.

Biying Li, Zhiwei Liu, Jinqiao Wang
COS-SLAM: Coordinate Attention Semantic SLAM with Pixel-to-Line Transformer

Semantic SLAM is essential for high-level scene understanding in robotics and augmented reality. Recent NeRF-based implicit SLAM methods provide efficient scene representations, but often suffer from insufficient semantic integration, high memory consumption, and suboptimal feature alignment. In this paper, we present COS-SLAM, a novel framework that enhances both spatial and semantic expressiveness through a Tri-Plane representation with coordinate attention. Furthermore, we introduce a lightweight Pixel-To-Line Transformer module for direct extraction of line features from RGB pixels, and propose a depth-guided truncated Gaussian sampling strategy to improve sampling efficiency. Extensive experiments on the Replica and ScanNet benchmarks demonstrate that COS-SLAM achieves geometry and appearance reconstruction comparable to or surpassing state-of-the-art SLAM methods. Notably, for semantic reconstruction, COS-SLAM achieves an mIoU exceeding 93% on the Replica dataset, more than 10% higher than the baseline.

Handong Shen, Lingyu Liang, Beibei Liu, Xiaohao Liu, Xinchao Li, Guoxi Sun, Shuangping Huang
Perspective Driven Prototype Alignment for Aerial-Ground Person Re-identification

Aerial-Ground Person Re-identification (AGPReID) aims to retrieve target individuals across UAV and ground cameras, focusing on the perspective variations due to altitude changes. The unique overhead perspective of UAVs presents challenges in achieving accurate semantic alignment for person re-identification (ReID). We propose a novel method named Perspective-Driven Prototype Alignment (PDPA) to address this issue. First, we design two learnable prompts for each identity to obtain view representations from different perspectives. Second, we propose a View-Guided Progressive Alignment (VGPA) module for cross-perspective refinement of text descriptions, exploring the intermediate feature space between different perspectives by combining the image and text features using a memory bank. To reduce the gap between image feature space and intermediate feature space, we propose an image-to-prototype cross-entropy loss to train the image encoder. Extensive experiments show that our method achieves SOTA performance on the CARGO and AG-ReID_v1 datasets.

Yuli Huang, Hongxu Chen, Zhanxiang Feng, Jianhuang Lai
Robust Single-View 3D Object Reconstruction with Stable Diffusion Generation and Farthest View Selection

View transformation robustness (VTR) is critical for deep learning-based single- or multi-view 3D object reconstruction, serving as a key metric to evaluate model stability under diverse view transformations. Despite its importance, VTR remains underexplored in existing 3D reconstruction research. While data augmentation with varied view transformations is a straightforward approach to enhance VTR, the rapid development of large vision models, particularly Stable Diffusion models, offers considerable potential for generating 3D models or synthesizing new view images from single inputs for related tasks. In this work, we propose leveraging Stable Diffusion models to generate novel views without incurring additional inference costs, thereby significantly improving the performance of single-view 3D object reconstruction. Specifically, shifting our focus from traditional neural radiance field methods, we explore view selection strategies in 3D reconstruction. First, we optimize viewpoint quality via the farthest point sampling algorithm. Then, we use a generative model to expand single views into multiple views, enhancing single-view reconstruction performance without retraining the implicit field. Extensive experiments show our method outperforms state-of-the-art 3D reconstruction techniques and other VTR-focused approaches, validating its superiority and effectiveness.

Zhouhang Luo, Qian Yu, Qi Zhang
From Sky to Site: A Unified Framework for Static and Dynamic 3D Reconstruction in Construction Sites

Accurate 3D reconstruction of construction sites is essential for progress tracking, safety management, and digital twin applications. However, UAV-based photogrammetry is limited in capturing dynamic scene changes, while surveillance cameras lack depth sensing and accurate calibration. To overcome these limitations, we propose a hierarchical reconstruction framework that integrates UAV imagery with surveillance image for scalable, real-time 3D reconstruction. Specifically, we first generate a metrically accurate 3D map from UAV images using GPS/RTK data and ground control points. Next, virtual views are rendered from this map and matched with surveillance images to localize surveillance cameras without manual calibration. Furthermore, by aligning monocular depth maps of surveillance images with the rendered depths, we calibrate the depth predictions to metric scale, enabling near real-time reconstruction of dynamic scene changes. In addition, our method demonstrates robust localization and reconstruction performance under challenging conditions such as weather variations and structural changes, making it well-suited for long-term deployment in dynamic construction environments.

Tonglin Chen, Xinlin Ren, Jinghao Huang, Jiangyu Feng, Bin Li, Xiangyang Xue
Adaptive Positional Encoding and Multi-scale Self-attention Transformer for Aerial Person Re-identification

With the widespread application of UAV surveillance in social security, Person re-identification (ReID) has naturally extended its scope to the field of unmanned aerial vehicles. However, traditional ground-based person re-identification (ReID) tasks typically rely on surveillance cameras with fixed positions and shooting angles. Under such conditions, pedestrian images exhibit limited resolution variation and implicitly assume an upright body posture perpendicular to the ground. However, in aerial scenarios, image resolution varies significantly with UAV flight altitude, while pedestrian poses demonstrate random translation and rotation phenomena. To address these challenges, this paper proposes an Adaptive Positional Encoding and Multi-scale Self-attention Transformer (APE-MSAT). The Adaptive Positional Encoding module leverages convolutional invariance to preserve robust neighborhood information against random pose variations, while the Multi-scale Self-attention Transformer employs progressively scaled feature maps to handle resolution variations. Our method achieves competitive results on aerial ReID datasets (PRAI-1581 and UAV-Human) as well as the ground-based Market-1501 benchmark.

Zhizhi Lu, Jianhuang Lai
Class-Incremental Learning for Surface Defect Detection

Surface defect detection is crucial for quality control and production optimization in industrial processes. However, traditional surface defect detection models have focused only on the closed static detection scenario. This paper aims to investigate the surface defect detection task to the more practical incremental detection. First, to cope with the performance requirements of industrial quality inspection, we propose an improved Deformable DETR. Specifically, a joint positional encoding method is introduced to enhance the model’s global perception. Then, the improved Deformable DETR is combined with an knowledge distillation to propose a class-incremental surface defect detection method. Experimental results in the surface defect detection scenario show that the proposed method achieves competitive performance and demonstrate the effectiveness of the proposed method in surface defect detection.

Bingrui Lei, Jiale Huang, Wenqian Jiang, Xiaohua Huang
Video Frame Interpolation via Iterative Optical Flow Refinement with Latent Motion Feature

Video frame interpolation is a fundamental technique for generating slow-motion effects and performing video frame rate upconversion. Accurate motion estimation is critical for flow-based interpolation methods. However, existing approaches often yield inaccurate optical flow in complex scenarios involving illumination variations or occlusions. These inaccuracies propagate to the final interpolated frames, introducing artifacts. To address this limitation, we optimize initial optical flow estimations. Specifically, a multi-scale feature extraction module first extracts motion features to estimate preliminary optical flow. Subsequently, an optical flow optimization module encodes the flow, extracts latent spatiotemporal motion features, and integrates them with contextual spatiotemporal information. A dedicated network then generates the refined optical flow, resolving inherent inaccuracies. Experimental results demonstrate that our optimization enhances flow details, producing interpolated frames with sharper object boundaries and superior quality.

Zhiqiang Jiang, Lin Liu, Tianrui Liu, Junjie Huang
Backmatter
Titel
Image and Graphics
Herausgegeben von
Zhouchen Lin
Liang Wang
Yugang Jiang
Xuesong Wang
Shengcai Liao
Shiguang Shan
Risheng Liu
Jing Dong
Xin Yu
Copyright-Jahr
2026
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9533-93-0
Print ISBN
978-981-9533-92-3
DOI
https://doi.org/10.1007/978-981-95-3393-0

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH