Skip to main content

2025 | Buch

Pattern Recognition and Computer Vision

7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part VIII

herausgegeben von: Zhouchen Lin, Ming-Ming Cheng, Ran He, Kurban Ubul, Wushouer Silamu, Hongbin Zha, Jie Zhou, Cheng-Lin Liu

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This 15-volume set LNCS 15031-15045 constitutes the refereed proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2024, held in Urumqi, China, during October 18–20, 2024.

The 579 full papers presented were carefully reviewed and selected from 1526 submissions. The papers cover various topics in the broad areas of pattern recognition and computer vision, including machine learning, pattern classification and cluster analysis, neural network and deep learning, low-level vision and image processing, object detection and recognition, 3D vision and reconstruction, action recognition, video analysis and understanding, document analysis and recognition, biometrics, medical image analysis, and various applications.

Inhaltsverzeichnis

Frontmatter

Low-Level Vision and lmage Processing I

Frontmatter
Focal Perception Transformer for Light Field Salient Object Detection

Recently light field saliency object detection (LFSOD) has attracted increasing attention due to the significant improvements in challenging scenes using rich light field cues. While many works have significantly progressed in this field, a deeper insight into its focal nature should be developed. In this work, we propose Focal Perception Transformer (FPT), which efficiently encodes the context within the focal stack and all-focal image. Specifically, we introduce focal-related tokens to summarize image-specific characteristics and propose a token communication module (TCM) to convey information and facilitate spatial contextual modeling. The features of each image are enriched and correlated with other images through the exchange of information between the precisely encoded focal-related tokens. We also propose a focal perception enhancement (FPE) strategy to help suppress noisy background information. Extensive experiments on four widely-used benchmark datasets demonstrate that the proposed model outperforms the state-of-the-art methods. The source code will be publicly available at https://github.com/combofish/FPTNet .

Liming Zhao, Miao Zhang, Yongri Piao, Jihao Yin, Huchuan Lu
A Fourier Transform Framework for Domain Adaptation

By utilizing unsupervised domain adaptation (UDA), knowledge can be transferred from a label-rich source domain to a target domain that contains relevant information but lacks labels. Many existing UDA algorithms suffer from directly using raw images as input, resulting in models that overly focus on redundant information and exhibit poor generalization capability. To address this issue, we attempt to improve the performance of unsupervised domain adaptation by employing the Fourier method (FTF). Specifically, FTF is inspired by the observation that the amplitude of the Fourier spectrum primarily captures low-level statistical information. In FTF, we effectively incorporate low-level information from the target domain into the source domain by fusing the amplitudes of both domains in the Fourier domain. Additionally, we observe that extracting features from batches of images can eliminate redundant information while retaining class-specific features relevant to the task. Building upon this observation, we apply the Fourier transform at the data stream level for the first time. To further align multiple sources of data, we introduce the concept of correlation alignment. We evaluate the effectiveness of our FTF method, we conducted evaluations on four benchmark datasets for domain adaptation, including Office-31, Office-Home, ImageCLEF-DA, and Office-Caltech. Our results demonstrate superior performance.

Le Luo, Bingrong Xu, Qingyong Zhang, Cheng Lian, Jie Luo
Bidirectional Alternating Fusion Network for RGB-T Salient Object Detection

RGB-Thermal Salient Object Detection(SOD) aims to identify common salient regions or objects from both the visible and thermal infrared modalities. Existing methods usually based on the hierarchical interactions within the same modality or between different modalities at the same level. However, this approach may lead to a situation where one modality or one level of features dominates the fusion result during the fusion process, failing to fully utilize the complementary information of the two modalities. Additionally, these methods usually overlooking the potential for the network to extract specific information in each modality. To address these issues, we propose a Bidirectional Alternating Fusion Network (BAFNet) consisting of three modules for RGB-T salient object detection. In particular, we design a Global Information Enhancement Module(GIEM) for improving the information representation of high-level features. Then we propose a novel bidirectional alternating fusion strategy which is applied during decoding, and we design a Multi-modal Multi-level Fusion Module(MMFM) for collaborating mulit-modal mulit-level information. Furthermore, we embed the proposed Modal Erase Module (MEM) into both GIEM and MMFM to extract the inherent specific information in each modality. Our extensive experiments on three public benchmark datasets show that our method achieves outstanding performance compared to state-of-the-art methods.

Zhengzheng Tu, Danying Lin, Bo Jiang, Le Gu, Kunpeng Wang, Sulan Zhai
SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images

Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework- for-Infrared-and-Visible-Images .

Ming Chen, Yuxuan Cheng, Xinwei He, Xinyue Wang, Yan Aze, Jinhai Xiang
A Cross-Consistency Strategy for Clearer Perception in Low-Light Haze

Low-light and hazy scenes often coexist, presenting challenges for visual enhancement in various tasks such as autonomous driving and video surveillance. Despite numerous methods proposed for image dehazing and low-light enhancement individually, their straightforward integration often fails to produce satisfactory results for this specific challenge. In this paper, we propose a novel approach to enhance visibility in low-light hazy scenarios. To tackle this formidable challenge, we introduce two key techniques: a cross-consistency dehazing-enhancement framework and a physically based simulation for generating a low-light hazy dataset. The framework is specifically designed to improve the visibility of input images by leveraging information from various sub-tasks, while the simulation is developed to create datasets with ground truth using the proposed low-light hazy imaging model. Extensive experimental results demonstrate that our method outperforms state-of-the-art solutions across various metrics, including a 9.19% increase in SSIM and a 5.03% increase in PSNR. Additionally, we conduct a user study on real images to underscore the effectiveness and necessity of our proposed method from the perspective of human visual perception.

Sijia Wen, Chaoqun Zhuang, Yunfei Liu, Feng Lu
Improve Corruption Robustness of Intracellular Structures Segmentation in Fluorescence Microscopy Images

Intracellular structure such as endoplasmic reticulum and mitochondrion is a significant topic in life science. In which, intracellular structures segmentation based on fluorescence microscopy (FM) images plays an important role in morphological analysis and clinical application of biological cells. However, due to natural interferences, FM image acquisition is usually unstable, resulting in degraded images. When deep neural network (DNN) models segment intracellular structures on these corrupted images, they often behave low robustness, namely, their extracted structures are unsatisfactory. To obtain more stable DNN models for widespread deployment in practice, we need to solve this problem. Therefore, this paper is committed to improving the corruption robustness of DNN models, so that they possess a stronger ability to deal with corrupted FM images. First, we propose three data augmentations to diversify the training data distributions. Second, we design experiments and find that flatter loss landscape is generally associated with higher corruption robustness. Therefore, LayerSAM-DA-Seg is proposed, a method that combines layer adaptive sharpness-aware minimization (LayerSAM) and data augmentation (DA) together. This method helps DNN models converge to flatter minima, thus gaining stronger robustness. At the same time, data augmentation will push models to memorize diverse data distributions, thus gaining greater ability to process corrupted data. Experiments indict that compared with single optimization, the combination of LayerSAM and data augmentation will achieve greater gains. Our LayerSAM-DA-Seg achieves the most stable intracellular structures segmentation on corrupted FM images. The code is available at https://github.com/cbmi-group/Improve-Robustness-of-ICS .

Liqun Zhong, Yanfeng Zhou, Ge Yang
RDSR:Reparameterized Lightweight Diffusion Model for Image Super-Resolution

The diffusion model has achieved impressive results on low-level tasks, recent studies attempt to design efficient diffusion models for Image Super-Resolution. However, they have mainly focused on reducing the number of parameters and FLops through various network designs. Although these methods can decrease the number of parameters and floating-point operations, they may not necessarily reduce actual running time. To enable DM inference faster on limited computational resources while retaining their quality and flexibility, we propose a Reparameterized Lightweight Diffusion Model SR network (RDSR), which consists of a Latent Prior Encoder (LPE), Reparameterized Decoder (RepD), and diffusion model conditioned on degraded images. Specifically, we first pretrain a LPE, it takes paired HR and LR patches as inputs, mapping input from pixel space to latent space. RepD has a VGG-like inference-time body composed of nothing but a stack of 3 $$\times $$ × 3 convolution and ReLU, while the training-time model has a multi-branch. Our diffusion model serve as a bridge between LPE and RepD: LPE employs distillation loss to supervise reverse diffusion process, the output of reverse process diffusion as a modulator to guide RepD to reconstruct high-quality results. RDSR can effectively reduce GPU memory consumption and improve inference speed. Extensive experiments on SR benchmarks demonstrate the superiority of our RDSR over state-of-the-art DM methods, e.g., RDSR-2.2M achieve 30.11 dB PSNR on DIV2K100 dataset that surpass equal-order DM-based models, while trading-off the parameter, efficiency, and accuracy well: running $${{\boldsymbol{55.8}}} \times \uparrow $$ 55.8 × ↑ faster than DiffIR on Intel(R) Xeon(R) Platinum 8255C CPU.

Ouyang Sun, Jun Long, Wenti Huang, Zhan Yang, ChenHao Li
ACENet: Adaptive Context Enhancement Network for RGB-T Video Object Detection

RGB-thermal (RGB-T) video object detection (VOD) aims to leverage the complementary advantages of visible and thermal infrared sensors to achieve robust performance under various challenging conditions, such as low illumination and extreme illumination changes. However, existing multimodal VOD approaches face two critical challenges: accurate detection of objects at different scales and efficient fusion of temporal information from multimodal data. To address these issues, we propose an Adaptive Context Enhancement Network (ACENet) for RGB-T VOD. Firstly, we design an Adaptive Context Enhancement Module (ACEM) to adaptively enhance multi-scale context information. We introduce ACEM in the FPN section, where it can adaptively extract context information and incorporate it into the high-level feature maps. Secondly, we design a Multimodal Temporal Fusion Module (MTFM) to perform temporal and modal fusion using coordinate attention with atrous convolution at the early stage, significantly reducing the complexity of fusing temporal information from RGB and thermal data. Experimental results on the VT-VOD50 dataset show that our ACENet significantly outperforms other mainstream VOD methods. Our code will be available at: https://github.com/bscs12/ACENet .

Zhengzheng Tu, Le Gu, Danying Lin, Zhicheng Zhao
Fake-GPT: Detecting Fake Image via Large Language Model

With the development of Artificial Intelligence Generated Content (AIGC), fake image detection has become increasingly challenging. Also leveraging the advanced capabilities of large language models (LLMs) in sequence prediction, we propose a novel perspective on fake image detection by fine-tuning pure LLMs. We introduce Fake-GPT, a LLM with 7 billion parameters which can differentiate between real and fake images. Unlike conventional image processing models, our approach directly process RGB pixel values without relying on any position embedding and visual-language feature alignment, thereby reducing model complexity and processing steps. Our research demonstrates the effective application of LLMs in detecting fake images, thereby expanding their application in non-textual domains. Extensive experiments conducted on various deepfake datasets show that Fake-GPT achieves competitive results compared with conventional image processing models, underscoring its potential as a new paradigm in the realm of image authentication.

Yuming Fan, Dongming Yang, Jiguang Zhang, Bang Yang, Yuexian Zou
Towards Elastic Image Super-Resolution Network via Progressive Self-distillation

Recently, there has been a growing demand to implement super-resolution (SR) networks on devices with constrained resources. Nevertheless, most existing SR networks must remain consistent during both training and testing, restricting the model’s adaptability in real scenarios. Consequently, attaining elastic reconstruction without retraining is a crucial challenge. To accomplish this, we propose a novel model compression and acceleration framework through the Channel Splitting and Progressive Self-distillation (CSPS) strategy. Specifically, we construct a compact student network from the target teacher network by employing the channel splitting strategy, which removes a certain proportion of channel dimensions from the teacher network. Afterward, we incorporate auxiliary upsampling layers into the intermediate feature maps and propose the progressive self-distillation. Once trained, our CSPS can achieve elastic reconstruction by adjusting the channel splitting ratio and the number of feature extraction blocks. Extensive experiments demonstrate that the proposed CSPS can effectively compress and accelerate various off-the-shelf SR models.

Xin’an Yu, Dongyang Zhang, Cencen Liu, Qiang Dong, Guiduo Duan
PHANet: Progressive Hybrid Attention Network for Enhanced Video Deraining

The task of video deraining is crucial and intricate in the field of computer vision due to the detrimental effects rain can have on video recordings, such as blurriness and occlusion. While numerous video deraining methods have shown remarkable accomplishments, two significant issues persistently endure and require attention: (1) How to efficiently and accurately extract spatio-temporal features from consecutive frames for background restoration, leveraging the spatial and temporal correlations within the video sequence, and (2) How to adaptively select motion patterns in the temporal domain to effectively handle complex rain movements. Regarding the challenges above, we propose a Progressive Hybrid Attention Network (PHANet), a novel approach consisting of a two-stage progressive network and multiple attention-based feature processing modules, providing an enhanced solution for video deraining by employing a task division strategy from coarse to fine and hybridizing various attention mechanisms. To be specific, we design a Spatio-Temporal Attention Module (STAM), a Supervised Fusion Attention Module (SFAM) and a Multi-Feature Refine Module (MFRM) to be applied to the network at different stages, which effectively improve the performance of the video deraining task with fewer computing resources. The experimental results on multiple public datasets demonstrate that our proposed PHANet exhibits significant improvements in both performance and speed compared to the previous state-of-the-art (SOTA) methods.

Ran Li, Tao Deng, Chengfan Yang, Yi Huang, Wenbo Liu
Frequency Adapter and Spatial Prompt Network for All-in-One Blind Image Restoration

Image restoration aims to obtain a high-quality image from a degraded one. For real-world applications, an increasing number of methods are moving towards addressing multiple degradations using a single model. However, most of these methods still require task-specific training and primarily extract information from the spatial domain. To overcome this challenge, we introduce a novel All-in-one network, FASPNet, which effectively incorporates both frequency and spatial information to handle various degradations, without requiring any degradation priors. Specifically, we propose a Frequency Refiner Module (FRM), which adaptively adjusts frequency representations and captures crucial global frequency information to facilitate better image restoration. Furthermore, to provide essential low-level information related to restoration, we introduce a Spatial Prompt Module (SPM), utilizing prompts to encode restoration-relevant spatial detail representations and abstract degradation patterns. Extensive experiments have demonstrated that our model outperforms other baseline models on multiple datasets for three common and challenging tasks: deraining, dehazing, and denoising.

Shuoming Chen, Wenjie Pei, Yao Lu, Guangming Lu
Ambient Illumination Disentangled Based Weakly-Supervised Image Restoration Using Adaptive Pixel Retention Factor

Existing image restoration algorithms are typically designed for specific domains. It is extremely challenging to achieve enhancement for both low-light and underwater images through a single model. Additionally, due to the extremely limited number of annotated underwater images, the model’s generalization performance is poor. In this paper, we present an ambient illumination disentangled network based weakly-supervised image restoration (WSIR) approach, aiming to utilize incomplete labeled images to achieve the restoration of various low-quality images. On the one hand, we design an illumination disentanglement network (Idnet) to learn the mapping rules for Retinex theory, and establish a data-driven camera response function (DdCRF) for illumination adjustment. On the other hand, we design a Adaptive Pixel Retention Factor Network (APRFNet) for generating the parameter maps in DdCRF, that improves its robustness and flexibility in complex and changeable environments, promoting the authenticity and visual aesthetics of the reconstructed results. Extensive experiments on public datasets and self-collected images demonstrate that our proposed scheme outperforms state-of-the-art methods in both qualitative and quantitative metrics.

Ruiqi Mao, Rongxin Cui
F4SR: A Feed-Forward Regression Approach for Few-Shot Face Super-Resolution

This paper presents a novel approach to face super-resolution that explicitly models the relationship between low-resolution and high-resolution images. Unlike many existing methods, the proposed approach does not require a large number of high-resolution and low-resolution image pairs for training, making it applicable in scenarios with limited training data. By utilizing a feed-forward regression model, the proposed method provides a more interpretable and transparent approach to face super-resolution, enhancing the explainability of the super-resolution process. In particular, by progressively exploiting the contextual information of local patch, the proposed feed-forward regression method can model the relationship between the low-resolution and high-resolution images within a large receptive field. This is somewhat similar to the idea of convolutional neural networks, but the proposed approach is completely interpretable. Experimental results demonstrate that the proposed method achieves good performance in generating high-resolution face images, outperforming several existing methods. Overall, the proposed method contributes to advancing the field of face super-resolution by introducing a more interpretable and transparent approach that can achieve good results with minimal training data.

Jican Fu, Kui Jiang, Xianming Liu
Thangka Mural Super-Resolution Based on Nimble Convolution and Overlapping Window Transformer

Thangka murals are important cultural heritages of Tibet, but most of the existing Thangka images are of low resolution. Thangka mural super-resolution reconstruction is very important for the protection of Tibetan cultural heritage. Transformer-based methods have shown impressive performance in Thangka image super-resolution. However, current Transformer-based methods still face two major challenges when addressing the super-resolution problem of Thangka images: (1) The self-attention mechanism used in Transformer models does not have high sensitivity to slender and curved edge textures; (2) The sliding window mechanism used in Transformer models cannot fully achieve cross-window information interaction. To resolve these problems, we propose a Thanka mural super-resolution reconstruction method based on Nimble Convolution (NConv) and Overlapping Window Self-Attention (OWSA). The proposed method consists of three parts: (1) a Nimble Convolution (NConv) to enhance the perception of slender and curved geometric structures; (2) an Overlapping Window Self-Attention (OWSA) to facilitate a more direct feature interaction among adjacent windows within the Transformer. (3) for the first time, we construct a Thangka image super-resolution dataset, which contains 8,868 pairs of $$512\times 512$$ 512 × 512 images. We expect this dataset to serve as a valuable reference for future research. Both objective and subjective evaluations validated the competitive performance of our method.

Liqi Ji, Nianyi Wang, Xin Chen, Xinyang Zhang, Zhen Wang, Yunbo Yang
EHAT:Enhanced Hybrid Attention Transformer for Remote Sensing Image Super-Resolution

In recent years, deep learning (DL)-based super-resolution techniques for remote sensing images have made significant progress. However, these models have constraints in effectively managing long-range non-local information and reusing features, while also encountering issues such as gradient vanishing and explosion. To overcome these challenges, we propose the Enhanced Hybrid Attention Transformer (EHAT) framework, which is based on the Hybrid Attention Transformer (HAT) network backbone and combines a region-level nonlocal neural network block and a skip fusion network SFN to form a new skip fusion attention group (SFAG). In addition, we form a Multi-attention Block (MAB) by introducing spatial frequency block (SFB) based on fast Fourier convolution. We have conducted extensive experiments on Uc Merced, CLRS and RSSCN7 datasets. The results show that our method improves the PSNR by about 0.2 dB on Uc Merced $$\times $$ × 4.

Jian Wang, Zexin Xie, Yanlin Du, Wei Song
MIAFusion: Infrared and Visible Image Fusion via Multi-scale Spatial and Channel-Aware Interaction Attention

The purpose of Infrared and Visible Image Fusion (IVIF) is to obtain fused images with highlighted objects and rich details. Existing Transformer-based IVIF algorithms mainly depend on self-attention mechanisms to capture long-range dependencies. However, the single-scale self-attention in original transformers can only extract spatial global features, ignoring channel and cross-dimensional self-attention. In order to solve this problem, we propose a novel Transformer-based IVIF approach based on Multi-scale spatial and channel-aware Interaction Attention (MIAFusion). Specifically, Dual Convolution Attention Module (DCAM) is applied to obtain channel and spatial features by combing spatial and channel attention. Moreover, Multi-Scale Transformer Module (MSTM) is utilized to capture both cross-dimensional and long-range dependencies. Different from single-scale self-attention in original transformers, Multi-scale Interaction self-Attention (MIA) is applied in MSTM to replace multi-head self-attention. By utilizing MIA, attention interaction is achieved among different channels, contributing to effective feature fusion of the two modalities. Qualitative and quantitative experimental results conducted on public available TNO, RoadScene and M3FD datasets prove that our method has better performance in comparison with several state-of-the-art methods.

Teng Lin, Ming Lu, Min Jiang, Jun Kong
Color Enhanced Network for Image Dehazing

Image dehazing is an important and challenging task in image processing. Existing dehazing methods often encounter color distortion in the dehazed results. To address this issue, in this paper, we propose a novel approach named Color Enhanced Dehazing network (CED). It consists of two main branches: a dehazing branch and a color reconstruction branch. Initially, we employ Fast Fourier Transform to separate the low-frequency sub-image from the hazy image, which contains substantial color information. Alongside inputting the original hazy image into the dehazing branch, we concurrently feed the low-frequency sub-image into the color reconstruction branch. This allows us to extract and reconstruct corresponding color information to augment the dehazing process. To thoroughly fuse the information from two branches, we design a Selective Spatial-Channel adjustment fusion module (SSC) for the feature fusion across different branches. Extensive experiments on benchmark datasets well demonstrate the effectiveness and superiority of the proposed method in the image dehazing.

Yongsha Nie
Advancing Real-World Burst Denoising: A New Benchmark and Dual-Branch Burst Denoising Network

Image denoising has made remarkable strides over the years, primarily focusing on reconstructing a clean image from a single degraded input. However, in practical applications, burst image denoising demonstrates promising potential for restoring higher-quality images from burst sequences, particularly relevant for burst photography. In this paper, we pioneer the establishment of a new real-world burst denoising benchmark, named RealBDN, which consists of noisy image bursts and the corresponding noise-free/noise-low images. To tackle the challenges of real-world burst denoising, we propose a Dual-Branch Burst Denoising (DBBD) network to investigate the real-world noise among image bursts in the Noise-Learning Branch for facilitating image denoising in the decoding branch. Our DBBD specifically employs a homography alignment and a feature extractor to separately extract noise features and image features. These noise features are then processed to derive fused noise features and generate an estimated noise map within the Noise-Learning Branch. The fused noise features serve as a guide for the Decoding Branch during the restoration process. Moreover, we have conducted extensive experiments on the RealBDN dataset. The results demonstrate the superior performance of our DBBD, as evidenced by both quantitative and qualitative measures. In the interest of promoting further research and development in this field, we will publicly release our dataset, trained model, and source codes.

Huilei Wu, Qing Zhao, Zhiyuan Song, Pengxu Wei
DIFNet: Dual-Domain Information Fusion Network for Image Denoising

Image denoising, a critical process in computer vision, aims to restore high-quality images from their noisy counterparts. Significant progress in this field has been made possible by the emergence of various effective deep learning models. However, these methods are typically confined to processing within the single-domain and exhibit weak performance in preserving detailed information, hindering their practical application. To address this issue, we propose an efficient Dual-Domain Information Fusion Network (DIFNet) for image denoising. Specifically, we design an aggregate frequency domain and spatial domain network to capture and fuse the detailed information. The DIFNet employs a Dual-Domain Feature Fusion Module (DFFM) to integrate the extracted dual-domain information, facilitating the recalibration of weights between the spatial and frequency domains, thereby emphasizing and restoring detailed information. In the DFFM, frequency domain information is extracted through a Frequency Domain Attention Module (FDAM), while spatial domain information is acquired via the convolutional blocks in the image denoising baseline model NAFNet. Experimental results demonstrate that the dual-domain denoising method can recover more detail while maintaining denoising performance. Furthermore, the proposed method outperforms state-of-the-art approaches on widely used benchmarks, highlighting its superior performance.

Zedong Wu, Wenxu Shi, Liming Xu, Zicheng Ding, Tong Liu, Zheng Zhang, Bochuan Zheng
Simultaneous Snow Mask Prediction and Single Image Desnowing with a Bidirectional Attention Transformer Network

Snow’s diverse shapes and sizes pose a significant challenge for single-image desnowing. Current methods typically rely on simple networks to predict the snow masks and subsequently remove snow from images accordingly. However, inaccurate predictions often result in incomplete snow removal. This paper introduces BAT-Net, a novel solution equipped with an image transformer encoder and dual transformer decoders. This architecture enables precise snow mask prediction and comprehensive single image desnowing simultaneously. The image encoder enhances and consolidates features from various layers using Scale Conversion Modules (SCM) and a Feature Aggregation Module (FAM). To ensure accurate snow mask prediction, the snow decoder mirrors the complete transformer structure of the background decoder. Within the background decoder, multiple Bidirectional Attention Modules (BAM) with integrated forward and reverse attention branches effectively leverage snow features and reverse snow features from the snow decoder, facilitating complete snow removal. Moreover, real-world desnowing datasets often contain fallen snow, complicating model training and validation. To address this challenge, we introduce the FallingSnow dataset, exclusively featuring scenes of falling snow. Experimental results across five diverse synthetic and real-world snow removal datasets demonstrate BAT-Net’s significant advancements in addressing the challenges of snow mask prediction and single-image desnowing.

Yongheng Zhang, Danfeng Yan
DBIF: Dual-Branch Feature Extraction Network for Infrared and Visible Image Fusion

In image fusion, combining infrared and visible images from different sensors is crucial to create a complete representation that merges complementary information. However, current deep learning approaches, mainly using Convolutional Neural Networks (CNNs) or Transformer architectures, do not fully capitalize on the distinct features of infrared and visible images. To overcome this limitation, we introduce a novel Dual-Branch feature extraction network for infrared and visible image fusion (DBIF). DBIF optimally leverages the advantages of CNN and Transformer for feature extraction from different types of images. Specifically, the Transformer’s proficiency in extracting global features renders it more suitable for extracting target information from infrared images, while the CNN’s superior sensitivity to capturing local information makes it more adept at extracting background texture information from visible images. Consequently, our DBIF architecture incorporates two distinct branches, content and detail, for feature extraction from infrared and visible images, respectively. Additionally, we introduce a Detailed Feature Enhancement Module (DFEM) to consolidate and amplify the prominent features extracted by the detailed branch. Through extensive experimentation across multiple datasets, we validate the effectiveness of our proposed approach, showcasing its superiority over existing fusion algorithms. Furthermore, our method shows substantial performance improvements, especially in object detection tasks. This underscores its practical relevance in various real-world applications that require accurate and efficient fusion of diverse image data types.

Haozhe Zhang, Rongpu Cui, Zhuohang Zheng, Shaobing Gao
Multi-dimensional Information Awareness Residual Network for Lightweight Image Super-Resolution

In recent years, Lightweight image super-resolution technology has achieved good performance. However, many models struggle to effectively capture and process global information, leading to problems such as loss of detail and unnatural texture in reconstructed images. To solve these problems, we propose a Multi-Dimensional Information Awareness Residual Network (MIAN), which adopts a lightweight design to ensure efficient image reconstruction performance. Firstly, our MIAN effectively aggregates multi-scale context information through multi-layer channel distillation blocks (MCDB), which helps to extract important features layer by layer and reconstruct high-frequency details more accurately. Secondly, we design hierarchical spatial amplification attention (HSAA) to further enhance attention to key areas of the image and significantly improve the ability to capture and reconstruct details by layering the importance of different areas. Thirdly, we propose rapid channel perception attention (RCPA), which makes the network more focused on the useful information of the current task by optimizing the information interaction between feature channels. Finally, we introduce lightweight deepwise global self-attention (LDGA), which can identify and utilize similar features in a wide range and effectively keep the details and texture information in the reconstruction process. Extensive experiments show that our MIAN significantly improves the quality of image super-resolution, reduces the parameters and calculation cost, and achieves state-of-the-art super-resolution reconstruction performance.

Ziyan Wei, Zhiqing Guo, Liejun Wang
Feature Pruning and Recovery Learning with Knowledge Distillation for Occluded Person Re-Identification

Occluded person re-identification aims at retrieving holistic and occluded images of a specific identity based on occluded person images. Most existing methods incorporate external models to focus on visible body parts, which leads to high computational costs and fails to handle complex occlusions. To achieve high accuracy while maintaining low computational complexity, we propose a novel Feature Pruning and Recovery Learning with Knowledge Distillation (FPRL-KD) network. FPRL-KD is a teacher-student distillation architecture that effectively transfers the refined discriminative knowledge from the holistic branch to the occluded branch. Specifically, we devise a Feature Pruning Learning Module in the occluded branch to dynamically explore potential low-quality token features, which alleviates the interference from irrelevant information and noise in the images. Besides, we devise a Feature Recovery Learning Module that replaces these tokens with learnable embeddings to excavate robust and discriminative features. By integrating the enhanced knowledge distillation algorithm, the occluded branch is encouraged to learn from the holistic branch. Experimental results on two occluded datasets and two holistic datasets demonstrate the effectiveness and superiority of the proposed approach.

Mengyu Hou, Wenjun Gan
A Redundancy-Suppression Based Event Sampling Method for Structured Representation

The output of event camera is asynchronous non-structured events, which are not convenient for visualization and application. It’s important to develop structured event representation (SER) method to convert events into structured image-like representations. Current SER methods primarily concentrate on the encoding of event attributes or features, while overlooking the event sampling process. Because one moving object will trigger a series of repeated events along its trajectory, overlooking event sampling will result in motion blur in the structured representation under complex motions or scenes. This paper aims to promote the study of SER from a new perspective and proposes a novel adaptive event sampling method. Specifically, a non-latest event suppression (NLES) approach is first proposed to identify the historical repeated events (i.e., redundant events) and the latest events triggered by the same stimulus. Owing to the interference of noise and timestamp perturbation, the latest events may not depict the latest motion state completely. Thus, we next design a redundancy measurement metric (RMM), which measures the ratio between redundant events and latest events, to control the sampling process of redundant events. Finally, by iteratively applying NLES and RMM on the global and local space, a novel redundancy-suppression based event sampling method (RSES) is proposed. RSES is a plug-and-play module that can be integrated with existing SER method. Experimental results show that RSES could realize adaptive event sampling, and effectively improve the visualization and downstream task performance of existing SER method.

Jupo Ma, Shunhong Li, Wen Yang
BSDiff: Low-Light Image Enhancement Via Blueprint Separable Convolution and Wavelet-Diffusion Model

Enhancing low-light images to achieve proper exposure and clean visual effects poses a significant challenge in computational photography, while leveraging the generative capacity of diffusion models to yield satisfactory outcomes is a viable solution. Nonetheless, the performance of diffusion models in image restoration tasks tends to be unpredictable, often resulting in blurry details. In response to these issues, we proposed a robust and effective low-light image enhancement method via blueprint separable convolution (BSConv) and wavelet-diffusion model, called BSDiff. Specifically, we take advantage of wavelet transform to preserve the detail information in the sub-bands and utilize the generative capacity of conditional diffusion models. To substantially reduce randomness in the inference process, we introduce a novel auxiliary loss function, the Charbonnier penalty, enabling the model to yield visually pleasing results. Furthermore, a high-frequency sub-band feature enhancement module based on BSConv is designed to stabilize the denoising capability of the model and effectively correct details and color discrepancies in images. Extensive experiments conducted on publicly available real-world benchmarks demonstrate that our method outperforms existing state-of-the-art methods in both objective performance and subjective visual quality.

Jiajun Shi, Qingbing Sang
SAM and Diffusion Based Adversarial Sample Generation for Image Quality Assessment

In recent years, significant progress has been made in the field of no-reference Image Quality Assessment (NR-IQA) based on deep learning. However, methods relying on deep learning are prone to producing erroneous results under adversarial sample attacks. To address this issue, we investigated the generation of adversarial samples for quality assessment, aiming to test, evaluate, and enhance deep learning-based Image Quality Assessment (IQA) algorithms. Leveraging the characteristics of IQA, we employed the Natural Image Quality Evaluator (NIQE) algorithm to categorize images into different levels of distortion. Additionally, we utilized the Segment Anything Model (SAM) to identify target regions in images vulnerable to adversarial attacks. Subsequently, we improved the Diffusion Projected Gradient Descent (Diff-PGD) method using these insights. We developed a novel adversarial sample generation tool capable of producing adversarial samples with high attack success rates. Extensive attacks on multiple state-of-the-art quality assessment models using publicly available datasets demonstrated the superior performance of our proposed approach. Furthermore, we envision its utility in assisting IQA researchers in evaluating and enhancing the robustness of IQA algorithms.

Shan Wu, Qingbing Sang
CF-LAM: Coarse-to-Fine Locally Affine Matching for Viewpoint Transformations

Finding reliable correspondences from two images of the same scene is an important component in computer vision. Due to viewpoint transformations in two views and differences between the images, the initial putative matches often contain a large number of outliers, which significantly impacts the effectiveness of downstream tasks. In this paper, we propose a handcrafted outlier filtering method named Coarse-to-Fine Locally Affine Matching (CF-LAM). First, coarse filtering is designed to calculate an adaptive threshold based on the distribution characteristics within a reliable iteration, in order to remove outliers with large deviations. Then, fine filtering is performed to calculate the outlier scores of the retained correspondences, based on residuals and rankings in each iteration. Finally, the consistency of the neighborhoods is calculated to preserve the final correspondences. We have conducted experiments on viewpoint transformations datasets in different scenes, compared with state-of-the-art methods, and achieved promising performance results.

Yongfu Lu, Bohan Li, Pengfei Zhang, Yong Li
FormerUnify: Transformer-Based Unified Fusion for Efficient Image Matting

Recently, deep learning-based methods in the field of image matting have incorporated additional modules and complex network structures to capture more comprehensive image information, thereby achieving higher accuracy. However, these innovations inevitably result in a decrement of inference speed and higher computational resource consumption. In this paper, we propose a Transformer-based unified fusion network for image matting, denoted as FormerUnify. Compared to existing methods, it is able to achieve a more optimal balance between accuracy and efficiency. FormerUnify is built upon the classic encoder-decoder framework, with its centerpiece being the Unified Fusion Decoder. This decoder is composed of three essential layers: unify layer, fusion layer, and upsampling prediction head, all of which work in concert to unify and fuse the rich multi-scale features extracted by the encoder effectively. Furthermore, we couple the Unified Fusion Decoder with an advanced Transformer-based encoder, and optimize their integration to enhance their compatibility and performance. Experimental evaluations on two synthetic datasets (Composition-1K and Distinctions-646) and an real-world dataset (AIM-500) affirm that FormerUnify achieves rapid inference speed without compromising its superior accuracy.

Jiaquan Wang
Camouflage Object Segmentation with Multi-scale Feature Aggregation and Boundary Generation

Camouflage object segmentation aims to segment objects that are similar to their surroundings. However, due to the inherent similarity between the foreground target and the background environment, it is difficult to fully extract the discriminative features and accurately locate the boundary of the camouflage objectly. To this end, a camouflaged object segmentation framework termed Multi-scale Feature Aggregation and Boundary Generation Networks (MFABGNet) is proposed. Specifically, we propose a multi-scale feature extraction encoder, which uses Transformer to extract global background features, and an efficient Feature Reconstruction Convolution module to extract and enhance local foreground features. In addition, we propose a hierarchical features aggregation decoder to facilitate aggregation of multi-scale features. The proposed decoder consists of two modules: Cascaded Feature Aggregation Module and Boundary Generation Module. The former concentrates on addressing the scale disparities among various features,aggregates global and local features, and efficiently facilitates the interaction between foreground and background information. The latter leverages spatial information from low-level features to progressively refine the boundary layer by layer, thereby producing a boundary representation that accurately identifies the camouflage object. The experimental results demonstrate that our model achieves competitive performance on the three benchmark datasets. Compared to other SOTA methods, the average maximum improvement of S-measure, E-measure, weighted F-measure and Mean Absolute error is up to 17.6 $$\%$$ % , 22.1 $$\%$$ % , 47.5 $$\%$$ % and 10.8 $$\%$$ % respectively. The codes are available at: https://github.com/jeremy0922/1.git .

Ye He, Wen Su, Jinfeng Gao, Guoqiang Jia
GCMLP: A Lightweight Network for Gamut Compression

Gamut compression emerges as a key technology in digital printing, ensuring minimal visual loss and color distortion from the design phase on monitors to the final print output. Monitors usually operate in a wide gamut (sRGB), and this color-rich representation is transformed and clipped to fit the printers’ smaller gamut (CMYK) when preparing images for printing, making it challenging to preserve the visual effects of the image. Existing algorithms either result in noticeable distortion after gamut mapping or are extremely time-consuming, and they all incur significant memory overhead. In this paper, we first introduce an embeddable lightweight gamut compression model based on a partitioning mechanism. Our specially designed masking encoder module divides the image into four partitions based on gamut and luminance features, tailored to meet the unique mapping characteristics of various regions. GCMLP requires only 31KB of storage to be saved as an annotation field in the original image, eliminating the need for additional memory space to store the processed image. For printing tasks, the model can be directly extracted from the original image for gamut compression. Comparative experiments show that GCMLP outperforms the industry-standard SGCK algorithm, reducing iCID by 22.58% and increasing SSIM by 12.30%. It achieves near-optimal performance in just 1/26 of the time. As part of this effort, we introduce a new gamut compression dataset of 2000 sRGB/CMYK images, which will aid in advancing research in this field.

Hao Xu, Xiaokai Du, Jiawei Zhu, Qin Wu, Zhilei Chai
Low-Light Light-Field Image Enhancement With Geometry Consistency

Low-light Light Field (LF) enhancement aims to recover high-quality LF images from their corresponding low-quality counterparts. Although some progresses has been made, the effective utilization of LF geometry consistency for efficient low-light LF enhancement remains a challenge. Most existing methods do not adequately leverage LF geometry information during the light-up process of low-light LF, leading to a significant performance drop. To relieve this issue, we propose a low-light LF enhancement method by fully considering the geometry consistency. By introducing a spatial-epipolar geometry information interaction model, our method is able to enhance the low-light LF quality by effectively aggregating spatial information and epipolar geometry information from LF images. Moreover, to further generate higher-quality enhancements for low-light images, we design a three-stage network to enhance the fine detail information. Experimental results reveal that our method demonstrates superior low-light light field enhancement capabilities compared to previous approaches.

Deyang Liu, Zhengqu Li, Xin Zheng, Jian Ma, Yuming Fang
Attention and Boundary Induced Feature Refinement Network for Camouflaged Object Detection

The intrinsic similarity between camouflaged objects and background environment makes camouflaged object detection (COD) task more challenging than traditional object detection task. Since the boundary of the camouflage object is difficult to determine, the existing COD method often cannot accurately identify the boundary details and complete structure of camouflage object. To solve these challenges, we propose a novel attention and boundary guided feature refinement network (ABNet) for improving the performance of COD. Specifically, ABNet mainly includes three main modules: multi-resolution feature enhancement module (MFEM), attention-induced edge-aware module (AIEM), and boundary-guide feature interaction module (BFIM). The MFEM is introduced to enhance the single-layer feature and maintain high-quality detailed information. Additionally, the AIEM is designed to model edge features effectively from the enhanced feature. Finally, the BFIM is incorporated to focus on the structural details of camouflaged objects, which aims to explore multi-level features between global and local contextual information simultaneously for facilitating more complete detection. Extensive experiments have proved the effectiveness of the proposed model, and our model demonstrates competitive performance compared to existing state-of-the-art models on four benchmark datasets.

Junmin Zhong, Anzhi Wang
SFformer: Adaptive Sparse and Frequency-Guided Transformer Network for Single Image Derain

Recently transformer models have become prominent models for single image deraining (SID) task. However, these models often fail to utilize frequency knowledge and appropriate self-attention mechanisms effectively, leading to inadequate extraction of rain features and persistent artifacts. To alleviate this problem, we propose Adaptive Sparse and Frequency-Guided Transformer Network (SFformer) for single image derain. Specifically, we propose Adaptive Sparse Attention (ASA) module to selectively pay attention to the most useful channels for better feature aggregation. In addition, considering that rain streaks mainly correspond to the high frequency components in the image, we introduce Frequency-Guided Feedforward (FGF) module to focus on rain streaks. Integrating these proposed modules into a UNet backbone, extensive experimental results on commonly used benchmarks show that the proposed method outperforms current state-of-the-art method. The source code of our work is available at https://github.com/HuluBaba/ECEDerain .

Xinrui Wang, Hongyun Zhang, Kecan Cai, Duoqian Miao, Qi Zhang, Miao Li
Towards Specular Highlight Removal Through Diffusion Model

Undesirable specular highlight degrades both the visual quality of images and the performance of subsequent image processing tasks. While recent deep learning-based methods have achieved notable advancements in highlight removal, challenges such as highlight residual and color inconsistency still exist. To address these challenges, our study proposes an end-to-end highlight removal network with a conditional diffusion model, which has gotten promising results in image restoration areas. To decrease highlight residual, we design a highlight detection module providing binary highlight location masks based on a TransUNet architecture, improving highlight detection accuracy from 0.97 to 0.98. To rectify color distortions, we integrate a color extraction module that supplies illumination invariant color map priors, ensuring color fidelity in the dehighlighted output. Furthermore, we introduce feature loss into our model during the training process rather than dependent on pixel loss only, which provides our model with feature-level information and could better preserve the content and structure of the original image. To the best of our knowledge, this is the first diffusion-based method tailored for highlight removal. Extensive experiments on the benchmark SHIQ dataset demonstrate that our method obtains competitive results compared with current state-of-the-art methods and significantly improves the SSIM value from 0.939 to 0.966.

Lu Pan, Hongwei Zhao, Runze Wu
FIR: A Plug-in Feature-to-Image Reconstruction Method for Feature Coding for Machines

With the groundbreaking development of Artificial Intelligence (AI) technology, the volume of video and image consumed by machines has surpassed that consumed by humans. Consequently, video coding for machines technology has grown rapidly, with feature coding as a prominent technique demonstrating exceptional compression and task performance. This technology has developed rapidly and has entered the stages of chip integration and industrialization. Feature coding for machines can only reconstruct feature tensors, not images. However, in typical machine vision application scenarios such as smart cities, industrial quality inspection, intelligent transportation, and automated broadcasting, there still persists a demand for video and image review. These needs are essential for retrospective analysis of abnormal events, confirmation of incidents, and evidentiary purposes. This work attempts to reconstruct images from feature tensors extracted for machine vision tasks to meet the needs of human visual observation. We propose a lightweight and plug-in feature-to-image reconstruction method for feature coding for machines, with low complexity neural network blocks. The proposed method achieves an average Peak Signal-to-Noise Ratio (PSNR) of up to 28.92 dB, meeting the requirements of human visual perception.

Yuan Zhang, Junda Xue, Huifen Wang, Yunlong Li, Lu Yu
Focal Aggregation Transformer for Light Field Image Super-Resolution

Transformer has achieved significant progress in light field image super-resolution (LFSR) due to its long-range dependency learning ability for inter-intra view feature aggregation. However, locality information of each sub-aperture view is ignored in intra-view and inter-view aggregation with Transformer, hampering the high-quality light field image reconstruction. To this end, we propose a global to local aggregation approach termed Focal Aggregation for LFSR. In particular, Focal Aggregation includes two strategies: inter-view global to local aggregation (InterG2L) and intra-view global to local aggregation (IntraG2L). InterG2L is proposed to obtain complementary information from different views. IntraG2L is developed to extract efficient representations of a single sub-aperture view. InterG2L and IntraG2L are organized in a cascade way so that the global information of the input can be gathered for each sub-aperture image in a coarse to fine aggregation approach. Meanwhile, we also develop a global to local hierarchical feature aggregation approach named HierG2L, which enhances the last hierarchical feature used for light field reconstruction according to the input. Based on the above three global to local aggregation strategies, we construct a focal aggregation transformer (FAT) for LFSR. Experiments are performed on commonly-used LFSR benchmarks. Results demonstrate that FAT achieves superior results compared with other leading methods on synthesized and real data.

Shunzhou Wang, Yao Lu, Wang Xia
CoMoFusion: Fast and High-Quality Fusion of Infrared and Visible Image with Consistency Model

Generative models are widely utilized to model the distribution of fused images in the field of infrared and visible image fusion. However, current generative models based fusion methods often suffer from unstable training and slow inference speed. To tackle this problem, a novel fusion method based on consistency model is proposed, termed as CoMoFusion, which can generate high-quality images and achieve fast image inference speed. In specific, consistency model is used to construct multi-modal joint features in the latent space with the forward and reverse process. Then, the infrared and visible features extracted by the trained consistency model are fed into fusion module to generate the final fused image. In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed. Extensive experiments on public datasets illustrate that our method obtains the SOTA fusion performance compared with the existing fusion methods. The code is available at https://github.com/ZhimingMeng/CoMoFusion .

Zhiming Meng, Hui Li, Zeyang Zhang, Zhongwei Shen, Yunlong Yu, Xiaoning Song, Xiaojun Wu
Patch Attacks on Vision Transformer via Skip Attention Gradients

Vision Transformers (ViTs) have demonstrated exceptional performance across various computer vision tasks. ViTs utilize the attention mechanism for feature extraction, thereby enhancing global information capture compared to convolutional neural networks. Currently, the adversarial attack methods designed for ViTs focus on creating adversarial samples using the degree of attention between tokens, while neglecting the impact of ViT’s nonlinear factors. In this paper, we propose to Skip Attention Gradient (SAG) to avoid the nonlinear effects of attention by leveraging the linear nature of ViTs to generate adversarial examples. We demonstrate that the adversarial patch selected by SAG can promote a more linear propagation of adversarial effects, thereby enhancing the attack’s effectiveness. Our method achieves a robust accuracy on DeiT-B that is 7% lower compared to the Patch-Fool method.

Haoyu Deng, Yanmei Fang, Fangjun Huang
Multi-scale Progressive Reconstruction Network for High Dynamic Range Imaging

High dynamic range (HDR) imaging aims to reconstruct ghost-free and detail-rich HDR images from multiple low dynamic range (LDR) images. Challenges such as exposure saturation and significant motion in the LDR image sequence can result in quality issues like ghosting, blurring, and distortion in the final synthesized image. To address these challenges, we present a new approach called Multi-Scale Progressive Reconstruction Network (MPRNet). The network consists of an encoder-decoder, Multi-Scale Progressive Reconstruction Module (MSPRM), and Dual-Stream Reconstruction Module (DERM). MSPRM utilizes a feature pyramid to tackle large-scale motions gradually. It incorporates an attention mechanism and scale selection module to progressively refine motion information within and across scales. DERM adopts a symmetric dual-stream structure to concurrently perform exposure recovery and content reconstruction. It guides the fine-grained restoration of overexposed regions through a joint loss function. The experimental results indicate that the MPRNet fusion results outperform the dominant models in qualitative and quantitative assessments, especially in accurately representing exposure-saturated regions, preserving non-aligned edge details, and maintaining color fidelity.

Ying Qi, Qiushi Li, Jian Li, Zhaoyuan Huang, Teng Wan, Qiang Zhang
Backmatter
Metadaten
Titel
Pattern Recognition and Computer Vision
herausgegeben von
Zhouchen Lin
Ming-Ming Cheng
Ran He
Kurban Ubul
Wushouer Silamu
Hongbin Zha
Jie Zhou
Cheng-Lin Liu
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9786-85-5
Print ISBN
978-981-9786-84-8
DOI
https://doi.org/10.1007/978-981-97-8685-5