nach oben

Neural Processing Letters

Erschienen in:

Open Access 01.04.2024

Hierarchical Patch Aggregation Transformer for Motion Deblurring

verfasst von: Yujie Wu, Lei Liang, Siyao Ling, Zhisheng Gao

Erschienen in: Neural Processing Letters | Ausgabe 2/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

The encoder-decoder framework based on Transformer components has become a paradigm in the field of image deblurring architecture design. In this paper, we critically revisit this approach and find that many current architectures severely focus on limited local regions during the feature extraction stage. These designs compromise the feature richness and diversity of the encoder-decoder framework, leading to bottlenecks in performance improvement. To address these deficiencies, a novel Hierarchical Patch Aggregation Transformer architecture (HPAT) is proposed. In the initial feature extraction stage, HPAT combines Axis-Selective Transformer Blocks with linear complexity and is supplemented by an adaptive hierarchical attention fusion mechanism. These mechanisms enable the model to effectively capture the spatial relationships between features and integrate features from different hierarchical levels. Then, we redesign the feedforward network of the Transformer block in the encoder-decoder structure and propose the Fused Feedforward Network. This effective aggregation enhances the ability to capture and retain local detailed features. We evaluate HPAT through extensive experiments and compare its performance with baseline methods on public datasets. Experimental results show that the proposed HPAT model achieves state-of-the-art performance in image deblurring tasks.

These authors contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In recent years, driven by the continuous advancement of imaging equipment and computing technology, the demand for high-quality images has become increasingly urgent. However, real-world imaging scenarios present various uncertainties and disturbances, such as camera shake[1], relative motion, and lighting changes[2], which inevitably lead to the appearance of image blur. These blurring phenomena significantly reduce the ability of humans and devices to perceive information in images. Therefore, the accuracy and precision of advanced vision tasks such as image segmentation [3], autonomous driving [4], satellite monitoring [5], etc. will be adversely affected.

To solve this classic ill-posed problem, many model-based methods [6‐14] model the image deblurring tasks as constrained optimization problems [15, 16]. These methods require carefully designed strong priors and regular features to constrain the possible solutions to the solution space. However, these methods with complex iterative solution processes face limitations in practical applications and are often difficult to adapt to various types of ambiguities and scenarios. Their robustness is weak when faced with complex situations.

With the rapid development of large-scale data sets and convolutional neural networks (CNN), learning-based methods[17‐23] have achieved remarkable achievements in the field of image restoration. These methods exploit abundant image data to implicitly learn the mapping relationship between blurred and sharp images. By training models to minimize the discrepancy between input and output images, they effectively achieve deblurring and image detail restoration. However, the practical applicability of these methods is limited by the feature extraction method and convolution operation characteristics of CNN [24, 25]. They exhibit sensitivity to input image size, making them prone to introducing distortion and noise while having difficulty processing global information effectively. Therefore, researchers have explored multi-scale processing of input images to enable models to capture information at various scales. In particular, the combination of encoder-decoder structures with residual learning or generative adversarial networks [26, 27] has been attempted to enhance performance. However, the design of the encoder-decoder architecture [20] mainly focuses on recovering image features layer by layer, but this mechanism that only relies on basic local feature extraction cannot fully capture the global semantic information in the source image. This limitation hinders the model’s ability to accurately understand the underlying semantic information of blurred images. As the structure deepens, it often fails to preserve the fine-grained local features of the input image.

Recently, a robust feature extractor based on the self-attention mechanism, called the Transformer model, has attracted widespread attention in the field of computer vision [28‐30]. Transformer excels at modeling long-term dependencies and correlations between different parts of an image, making them particularly effective for intensive tasks such as image deblurring. Nonetheless, the quadratic computational complexity issue when dealing with large size feature maps or increasing the number of attention heads still needs to be addressed[31]. Furthermore, existing Transformer-based methods cannot dynamically identify feature importance. Features at different levels may have different semantics and representation capabilities, and being regarded as equivalent entities will lead to information confusion and loss. Simply concatenating or adding features from different layers often fails to fully explore potential interdependencies across layers, thus limiting the model’s ability to solve complex image deblurring challenges. In addition, The feedforward network of Transformer usually focuses on implementing nonlinear transformations [32, 33], but within a limited receptive field, it adopts the same set of learning parameters for all inputs, regardless of the underlying fine-grained features in overlapping image patches. This limitation results in a waste of local features, which may lead to artifacts and distortions.

In this paper, a novel Hierarchical Patch Aggregation Transformer Network (HPAT) is proposed, which aims to further explore the performance of transformer networks for image deblurring tasks. Specifically, in the feature extraction stage, we designed an additional hierarchical attention fusion module to adaptively integrate features at different levels and enhance the network’s ability to model global semantic information within the image. The hierarchical attention fusion module consists of cascaded axis-selective Transformer blocks and a cross-layer feature interaction mechanism. Since objects in an image may change position and orientation while maintaining their semantic information, we propose to use axis-selective Transformer blocks with shared weights to compute image similarity. This method effectively captures the spatial relationships and structural information between features with linear complexity. The hierarchical attention fusion module facilitates the cross-layer flow of attention information and adaptively fuses features at different levels, which solves the singularity and limitations of the original Transformer model in feature processing. Our work enables higher-level semantic learning at lower computational cost. In addition, we updated the feedforward module in the Transformer model to Fusion Feedforward Network (F3N), which aggregates more token information to improve performance and achieve restoration of smooth boundaries and overall semantics.

The main contributions of our work are summarized as follows:

In the feature extraction stage, the Hierarchical Attention Fusion Module (HAFM) is proposed to solve the problem of lack of semantic information in the early stages of the encoder-decoder structure. This module utilizes an axis-selective Transformer block with linear complexity to model the spatial relationship between pixels, which adaptively enhances important features in global space. Subsequently, the cross-layer feature interaction mechanism realizes similarity representation and enhancement of features at across different levels.
F3N is used to replace the standard feedforward network structure in the Transformer model. F3N is able to match and align sub-token segments corresponding to the same pixel position, and can aggregate information from different patches for the same pixel position. The problem of losing fine-grained local features during the decoding process is solved without adding additional learnable parameters.
HPAT introduces a novel Transformer model to capture local and global spatial information effectively. Through extensive experiments and ablation studies, HPAT has demonstrated competitive results on image deblurring, as shown in Fig. 1.

2.1 CNN Based Image Deblurring

Digital images are composed of high-dimensional pixels. Although the location and orientation of objects in an image may change, the semantic content is usually remain invariant. Convolutional Neural Network (CNN) is demonstrated remarkable performance [34] across various image processing tasks by leveraging the properties of local patterns and weight sharing. The CNN not only adapts to translation invariance within images [24], it also learns to handle variations in images through transfer learning and data augmentation.

In the domain of image deblurring, researchers have introduced many approaches based on the CNN. Sun, Schuler et al. [35] employed the CNN for image deblurring, and yielded specific results. Nah et al. [1] proposed the DeepDeblur, which directly learns the deblurring process in an end-to-end manner from image data. Tao et al. [36] introduced the SRN, which establishes residual skip connections at various scales to capture multi-scale features from original blurring images. It focuses on differences between the blurred and sharped regions, rather than differences across the entire images.

Furthermore, alternative approaches such as Orest et al. [37] introduced adversarial learning to address pixel-level and content-level errors. Mao et al. [38] introduced the DeepRFT, which redesigns residual blocks by combining channel-level fast Fourier transform[39] and models high-frequency and low-frequency information simultaneously. Zamir et al. [40] proposed the MPRNet, a parallelizable multi-stage architecture that achieves multi-resolution information fusion in the image reconstruction stage. Chen et al. [41] introduced the HINet, which recalibrates the mean and variance of features by incorporating instance normalization with learnable affine parameters. MAXIM proposed by Tu et al. [42] has a symmetric encoder-decoder structure and utilized axis gating [43] and spatial attention gating weights for fixed window and grid deployment, which promotes global and local spatial interactions. Guo et al. [44] introduced the MFFDNet, which used a dense connection structure for feature fusion and enhances the deblurring effect by reusing extracted features.

2.2 Transformers based image deblurring

The Transformer architecture with global receptive field has been widely used in the field of vision. Vision Transformer was proposed by Dosovitskiy et al. [45] and applied to image data processing, and demonstrated strong adaptability to different image sizes. However, Vision Transformer divides the image into fixed-size blocks and performs linear transformation, which may lead to information loss, especially for pixel-level tasks.

To solve this problem, IPT [32] was proposed as an end-to-end model that improves performance by introducing self-attention mechanisms into traditional Convolutional Neural Network feature extractors. Liu et al. [33] introduced the Swin Transformer, and incorporated local window attention mechanisms and shift operations to prevent information loss and facilitated feature propagation across various resolutions. Nonetheless, an excessive emphasis on global information modeling can lead to an imbalance of local information. To address this problem, Wang et al. [46] introduced the Uformer, a structured encoder-decoder architecture constructed with local-enhanced window Transformer blocks. This design effectively captures both the local and global dependencies for image restoration tasks. Similarly, the Restormer structure proposed by Zamir [47] employs gate mechanisms, local content mixing, cross-feature covariance computation, and local context blending to enhance information flow and global connectivity modeling. Additionally, CS-Kit [48] groups similar blocks, employs spatial local attention among these blocks, and utilizes Scale-aware Patch Embedding (SPE) to achieve mixed-scale patch aggregation, catering to requirements of locality, non-locality, and cross-scale aggregation.

2.3 Feedforword Network

In the Transformer architecture [49], feedforward networks are the key components of both the encoder and decoder blocks. A feedforward network consists of classic fully connected layers that perform nonlinear transformations on the output of the self-attention layer. This structure can better adapt to data complexity and generate new representations to propagate to subsequent layers. The feedforward network remains a fixed size independent of the input dimension, allowing the Transformer network to fit sequences of different lengths. This fixed-size feature facilitates the application of the Transformer network to sequences of different lengths.

However, for image deblurring tasks, feedforward networks must learn to recognize and restore fine-grained details lost during the blurring process. The traditional feedforward network is difficult to generate sharp images by fully utilizing the contextual information in the input features[50]. This limitation can lead to poor results and reduce the efficiency of detail restoration. Furthermore, the receptive field of traditional feedforward networks is limited, which hinders their ability to effectively capture long-range dependencies in input images. To address these shortcomings, feedforward network variants have emerged in the Transformer architecture [46, 47] in recent years. In Sects. 3.1 and 4.3.2 we provide an in-depth comparison between our proposed F3N and these variants.

3 Hierarchical Patch Aggregation Transformer for Motion Deblurring

Our proposed Hierarchical Patch Aggregation Transformer Network (HPAT) is an efficient multi-level network based on Transformer blocks, which aims to fully exploit the local and non-local features of images, thereby improving image deblurring capabilities. The structure of HPAT is shown in Fig. 2.

3.1 Overall Pipeline

Given a degraded image $I \in \mathbb {R}^{3 \times H \times W}$, we initially use a linear mapping block to learn the inter-pixel correlations within local regions, and obtain shallow features $F_0 \in \mathbb {R}^{H \times W \times C}$. Subsequently, the Hierarchical Attention Fusion Module (HAFM) transforms shallow features $F_0$ into enhanced features $F_E$. In this module, $F_0$ captures spatial relationships sequentially along the height and width dimensions of the image through M Axis-Selective Transformer Blocks (ASTB). This effectively filters out more informative details, producing intermediate features $F_1,F_2,...,F_M \in \mathbb {R}^{H \times W \times C}$. Subsequently, the cross-layer feature interaction mechanism aggregates activations from different layers to form the enhanced map $F_E \in \mathbb {R}^{H \times W \times C}$. Next, the optimized $F_E$ is passed through a symmetric 4-stage encoder-decoder structure. Each encoder-decoder stage is composed of a pair of Local Enhancement Transformer Blocks (LETB). It consists of a Window-based Multi-head Self-Attention (W-MSA) residual module and a feedforward network residual module to obtain the encoder features $E_i \in \mathbb {R}^{\frac{H}{2 ^i} \times \frac{W}{2^ i} \times 2^i C}$ and decoder features $D_i \in \mathbb {R}^{\frac{H}{2^i} \times \frac{W}{2^i} \times 2^ i C}$, where $i=0,1,2,3$. To recover the lost high-frequency information, skip connections are established between the encoder features at the corresponding level of each encoder-decoder layer. Finally, a clean version of the image is generated by adding the residual $Re\in \mathbb {R}^{3 \times H \times W}$ to the input image.

We use pixel-level Charbonnier loss [19] to ensure that HPAT reduces artifacts while maintaining sharpness.

$$\begin{aligned} \begin{aligned} \mathbb {L}(G_i, {I'}_i) = \frac{1}{N} \sqrt{{\left\| G_i-{I}'_i \right\| }^2 + {\xi }^2}, \end{aligned} \end{aligned}$$

(1)

here, $G_i$ represents the pixel values of the ground truth, ${I'}_i$ denotes the pixel values output by the model, and $\xi =10^{-3}$ is utilized for smoothing the loss function.

3.2 Hierarchical Attention Fusion Module

In the encoder-decoder model built by the Transformer block [46, 47], the encoder stage usually compresses the input image into a lower-dimensional feature representation space, and then gradually reconstructs the image in the decoder stage. In addition, each layer of the encoder-decoder structure processes features independently, so that it cannot fully utilize global context information, thus limiting its feature representation capabilities. Therefore, we propose to integrate a hierarchical attention fusion module in the feature extraction stage to enhance the capability of the encoder-decoder structure. This strategic addition can more effectively exploit global contextual information, thereby enhancing feature representation capabilities and understanding of the overall semantic aspects of the image. The hierarchical attention fusion module includes the Axis-Selective Transformer Block (ASTB) and the Cross-layer Feature Interaction Mechanism (CFIM) module, as shown in Fig. 2.

3.2.1 Axis-Selective Transformer Block

To avoid excessive computational costs in the feature extraction stage, the ASTB divides the multi-head self-attention computation into operations along the height and width dimensions. Unlike existing methods [51, 52], our axis attention method is sequential and follows an ordered concatenation approach rather than executing in parallel. In addition, a filtering mechanism is proposed in the feed-forward network to adaptively select and emphasize key feature information. This improvement focuses attention more effectively on valuable features, thereby enhancing the model’s feature expression and reconstruction capabilities. As shown in Fig. 2, the information flow of ASTB can be explained as follows:

$$\begin{aligned} \begin{aligned} \hat{f^L}&= CA-MSA(LN(f^{L-1})) + f^{L-1}\\ f^L&= FEN(LN(\hat{f^{L}})) + \hat{f^L}, \end{aligned} \end{aligned}$$

(2)

where, LN denotes layer normalization operation, and $f^L,\hat{f^{L}}$ denote the outputs of the cross-axis multi-head self-attention (CA-MSA) and the feedforward enhanced network (FEN) respectively.

Figure 3a illustrates the entire process of cross-axis multi-head self-attention (CA-MSA). To begin with, a combination of $1\times 1$ and $3\times 3$ depthwise convolutions is applied to the layer-normalized features $X_0 \in \mathbb {R}^{H \times W \times C}$ for nonlinear mapping. This transforms the features into representations suitable for the attention mechanism, specifically queries (Q), keys (K), and values (V). The $1\times 1$ convolution is utilized to map input features to a higher-dimensional space, while the $3\times 3$ depthwise convolution is employed to capture more extensive contextual information within the input features. To perform the dot product operation along a specific axis and capture attention information along that axis, we reshape the dimensions of Q, K, and V to obtain $Q'\in \mathbb {R}^{H \times W \times C}$,$K'\in \mathbb {R}^{C \times H \times W}$ and $V'\in \mathbb {R}^{C \times H \times W}$. This reshaping is done to simplify the computation of the dot product. The self-attention mechanism is utilized to compute correlations between different positions, capturing dependencies among features. These dependencies are commonly linked to spatial positions and channel expressions. Therefore, we need j independent heads to autonomously learn correlations among different channels. These heads are acquired by splitting $Q', K'$, and $V'$ along the channel dimension, represented as: ${{q'}^1,...,{q'}^j} \in \mathbb {R}^{H \times W \times (C/j)}$, ${{k'}^1,...,{k'}^j} \in \mathbb {R}^{(C/j) \times H \times W}$, ${{v'}^1,...,{v'}^j} \in \mathbb {R}^{(C/j) \times H \times W}$. The similarity between each pair of heads $q^j$ and $k^j$ is calculated through dot product operations, followed by normalization using the softmax function, resulting in the attention weight matrix ${A^j} \in \mathbb {R}^{H \times W} $ along the height axis. This matrix reveals the distribution of importance of input features along the height axis. Subsequently, through dot product operations between $A^j$ and $V'$ for each head, a weighted averaging operation on $V'$ is achieved, emphasizing more important feature information along the height axis. Ultimately, concatenating the results of the weighted averaging along the channel dimension and applying reshape and $1\times 1$ convolution operations result in the output feature $X_h$ along the height axis. The formulation of this process is as follows:

$$\begin{aligned} \begin{aligned} X_h&= Conv_{1\times 1} Reshape Concat_{i=1}^j (softmax({{q'}^i {k'}^i}/\beta ){v'}^i), \end{aligned} \end{aligned}$$

(3)

where i denotes the i-th head of q’, k’, and v’, and $\beta $ represents the scaling factor, the operation Concat denotes the concatenation process. This ensures that the feature representation at each position incorporates information from all height positions, providing global and comprehensive information for subsequent processing. Next, a reshape operation is performed to execute attention along the width axis.

In the processing of large-scale image data and real-time applications, it is crucial to reduce redundant information in network processing, enhance feature compactness, and improve representation efficiency. In this case, it becomes imperative to filter out features with relatively low information content during the feature extraction process. As shown in Fig. 3b, compared with Gated-Dconv feed-forward network (GDFN) [47], our Feature Enhancement Network (FEN) retains two parallel paths, each path contains gating mechanism to more accurately evaluate feature importance. This nuanced approach further augments the effectiveness of filtering. Given the normalized tensor $X\in \mathbb {R}^{\hat{H}\times \hat{W}\times \hat{C}}$, we begin by evenly splitting the channel count into two halves, creating two pathways. Each pathway utilizes a combination of $1\times 1$ convolutions and $3\times 3$ depthwise convolutions to adeptly capture local information. Subsequently, the GELU activation function and element-wise multiplication operations are applied to control the features from the two branches, thereby filtering and emphasizing crucial information. This enables the network to learn essential details from the input data. Ultimately, information from the two branches is fused via element-wise summation. The formulation of FEN is as follows:

$$\begin{aligned} \begin{aligned} Gating = \varnothing \left( {{W_d}^0{W_p}^0X} \right)&\bigodot \left( {{W_d}^1{W_p}^1X} \right) + \varnothing \left( {{W_d}^1{W_p}^1X} \right) \bigodot \left( {{W_d}^0{W_p}^0X} \right) \\ Y&= {W_p}^2Gating(X) + X, \end{aligned} \end{aligned}$$

(4)

where ${W_p}^i(i=0,1,2)$ denotes a $1\times 1$ convolution, ${W_d}^j(j=0,1)$ signifies a $3\times 3$ depthwise convolution, $\varnothing $ signifies the GELU activation function, and $\bigodot $ represents element-wise multiplication operation.

3.2.2 Cross-Layer Feature Interaction Mechanism

In the feature extraction stage, features from different layers usually have varying representation capabilities and importance. However, if each layer processes features independently, it will constrain the diversity and depth of feature expressions, potentially compromising the ability of the encoder-decoder structure to capture global context.

As illustrated in Fig. 4, the concatenation of features from M individual Attention Selective Transformer Blocks (ASTB) along the channel dimension results in the reshaped feature $F_M\in \mathbb {R}^{H\times W\times MC}$. Subsequently, feature transformation yields Q, K, V tensors for calculating attention weights and feature fusion, expressed as follows:

$$\begin{aligned} \begin{aligned} Q = {W_Q}{F_M} \\ K = {W_K}{F_M} \\ V = {W_V}{F_M} \end{aligned} \end{aligned}$$

(5)

Where ${W_Q}$, ${W_K}$, and ${W_V}$ represent the weight matrices.

To derive a layer-wise correlation matrix, denoted as $L_A \in \mathbb {R}^{M \times M}$, describing inter-layer relationships and regulating attention weight allocation among layers, we reshape the dimensions of Q, K, and V into $M \times HWC$. Subsequently, the layer-wise correlation matrix LA is multiplied with the reshaped values V, facilitating the weighted fusion of features across different layers. This process aims to retain crucial feature information while eliminating irrelevant features. Finally, the output of the self-attention mechanism is fused with the original input features to comprehensively preserve the original information. The core steps of the Cross-Layer Feature Interaction Mechanism (CFIM) are as follows:

$$\begin{aligned} \begin{aligned} LA = Softmax(\frac{\hat{Q}\hat{K}}{\beta })\hat{V}\\ F_E = {W_p}^1 LA + F_M, \end{aligned} \end{aligned}$$

(6)

where $\beta $ represents the scaling factor, $\hat{Q}, \hat{K}, \hat{V}$ are the reshaped Q,K,V matrices, LA is the layer correlation matrix, $F_M$ denotes the reshaped features concatenated along the channel dimension, ${W_p}^1$ represents a $1\times 1$ convolution, and $F_E$ is the output enhanced feature.

3.3 Fusion Feedforward Network

We have investigated three distinct architectures of feedforward networks, one of which emphasizes a global perspective and another which prioritizes local enhancement. However, we observe limitations in the performance of these two architectures in image deblurring tasks. To surmount these limitations, we introduce a novel fusion feedforward network (F3N) designed to integrate and reconstruct features using folding and unfolding operations.

First, we introduce the feedforward network of the traditional Vision Transformer (ViT)[29, 45]. As shown in Fig. 5a, it usually includes two linear layers and a GELU activation function. This structure somewhat limits the model’s ability to capture local information. The linear layers mainly conduct fully connected projections, which may result in the neglect of certain local relationships, particularly those related to spatial connections between pixels.

Next, inspired by pioneering research work[46], we adopt a locally enhanced feedforward network as shown in Fig. 5b. It begins with a linear projection using a $1\times 1$ convolution operation to increase the feature dimensions of each token. Subsequently, the tokens after linear projection are reshaped into a 2D feature map, a $3\times 3$ depthwise separable convolutional layer is used to capture local information. Finally, the tokens are remapped from the feature map, and $1\times 1$ convolution is applied to reduce its dimensionality to facilitate integration with the Transformer structure to maintain the consistency of the overall model. By introducing the TokensToImg operation and deepwith convolution, this feedforward network is capable of better capturing information between adjacent pixels in 2D space, thereby achieving improved spatial feature extraction and propagation. However, despite the effectiveness of local convolutional operations for neighboring pixels within a limited receptive field, they constrain the network’s ability to learn fine-grained features across larger regions of the image. This limitation may result in suboptimal performance when addressing deblurring tasks that require a larger receptive field.

To address these concerns, we propose the Fusion Feedforward Network (F3N). Differing from the $3\times 3$ depthwise separable convolutional layer utilized in Fig. 5b, we employs fold and unfold operations to partition the image into multiple blocks. It enhances the receptive field by fusing features of overlapping region pixels. As these two operations do not introduce trainable parameters and remain fixed transformation operations, they remain devoid of additional parameterization. Specifically, we aggregate corresponding values from adjacent patches to enhance the network’s ability to learn fine-grained features. This architectural design not only empowers the network to exploit broader contextual information but also enables the capture of extensive interdependencies across different parts of the image, culminating in more precise and robust deblurring outcomes. As illustrated in Fig. 5c, we apply a $1\times 1$ convolution to each token for linear projection, increasing its feature dimension. Subsequently, the Fold operation divides tokens into 9 blocks of size $7\times 7$. The Unfold operation then unfolds the divided blocks in a specific order, ensuring overlapping regions between these blocks. We create a normalizer tensor to balance feature values. Specifically, the normalizer is designed as a tensor with all elements equal to 1, ensuring that normalized weights are used in the fusion process. This guarantees reasonable weight allocation for pixel features in overlapping regions, preventing certain regions from dominating the fusion process and maintaining the rationality and stability of the fusion results. Finally, the fused feature tensor is reshaped to the original shape and undergoes linear transformation through a $1\times 1$ convolution to obtain the output.

4 Experience and Results

In this section, we employ the widely recognized Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [53] metrics for quantitative comparison. We compare our proposed HPAT method with various state-of-the-art methods to evaluate its progress. Furthermore, we conduct comprehensive ablation experiments to thoroughly analyze how the design introduced in the previous section affects the effectiveness of image deblurring.

4.1 Datasets and Experimental Settings

(1) Datasets

The GoPro dataset [1] was used for algorithm trainning, which includes 2103 pairs of high-quality real-world image samples at a resolution of 1280–720, to train the HPAT model. Additionally, we select 1111 pairs of completely distinct test images from this dataset. These test images cover a wide range of scenarios and situations, including irregular object motion, close-range defocusing, backlighting, and other variations. To validate the model’s generalization capability, we test the model on 2025 pairs of images from the HIDE dataset [54], exhibiting blurriness due to inaccurate focus resulting from varying depths in the captured scenes. We also evaluate the model on 980 pairs of low-light or high-light scene images from the RealBlur_J dataset and 980 pairs of nighttime low-light scene images from the RealBlur_R dataset [55].

(2) Experimental settings

We utilize an A100 SXM4 GPU in a desktop computer and the PyTorch[56] version 1.9.0 to train the HPAT model end-to-end. No pretrained networks are utilized, and the AdamW optimizer [57] is employed to minimize the reconstruction error using default parameters. Specifically, images are partitioned into blocks of size 256–256. Data augmentation involving random horizontal flipping is applied to training samples with a batch size of 16. The initial learning rate is set at 2e-4, and a cosine annealing strategy [58] is employed to update the learning rate, reducing it to 1e-6 after 3000 epochs. During the feature extraction stage, the number of M is 3, with the depth of ASTB varying as 1, 2, 4, and the number of attention heads as 1, 2, 4. In the 4-level encoder-decoder stage, the Encoder depth varies as 1, 2, 8, 8, while the attention heads vary as 1, 2, 4, 8.

(3) Parameters and detection time

The model parameters and inference time of the proposed model and comparative model were calculated using the GoPro dataset, as shown in Table 1. Our model possesses a moderate number of parameters, and an inference time of 328 ms, which demonstrating a relatively efficient performance when processing a single input. The moderate parameter count and the relative efficiency in inference time render our model feasible in scenarios requiring high-quality deblurring.

Table 1

A comparison of the parameter count and inference time for different architectures on the GoPro dataset

Method	Params(M)	Time(ms)	Year
DeblurGan-v2[26]	61	42	2018
DMPHN[56]	22	212	2019
DBGAN[59]	12	909	2020
MIMO-UNet[60]	16	22	2020
DeepRFT[38]	10	242	2021
MPRNet[40]	20	104	2021
HINet[41]	89	31	2022
DGUNet[61]	18	–	2022
NAFNet[62]	68	92	2022
IPT[32]	114	–	2020
Uformer-B[46]	51	310	2021
Restormer[47]	26	798	2022
MFFDNet[44]	38	123	2023
Our	58	328	2023

Table 2

Quantitative comparisons of our models with the SOTA deblurring methods on the GoPro dataset. The best and 2nd best results are highlighted and underlined

Method	PSNR$\uparrow $	SSIM$\uparrow $	Year
DeblurGan-v2[26]	29.08	0.873	2018
DMPHN[56]	30.45	0.902	2019
DBGAN[59]	31.10	0.942	2020
MIMO-UNet[60]	32.44	0.933	2020
DeepRFT[38]	32.82	0.938	2021
MPRNet[40]	32.66	0.936	2021
HINet[41]	32.77	0.936	2022
DGUNet[61]	32.71	0.937	2022
NAFNet[62]	32.87	0.948	2022
IPT[32]	32.58	0.935	2020
Uformer-B[46]	32.97	0.942	2021
Restormer[47]	32.92	0.940	2022
MFFDNet[44]	32.87	0.959	2023
Our	33.43	0.961	2023

4.2 Results and Analysis

4.2.1 Deblurring Results on the GoPro Dataset

Quantitative Evaluation: On the GoPro dataset, we compared our proposed HPAT method against existing CNN-based and Transformer-based methods, as depicted in Table 2. HPAT outperforms all competing methods in both PSNR and SSIM metrics. Compared to the runner-up, HPAT achieved an increase of 0.46 dB in PSNR on the GoPro dataset, and an improvement of 10% in SSIM.

Qualitative Evaluation: Visual comparisons between our HPAT model and the contrastive methods on the GoPro dataset are illustrated in Figs. 6 and 7. Evidently, our model has restored images with greater clarity. Prior methods exhibited shortcomings in restoring the overall contour of vehicles and intricate details of tire rims. They also struggled to accurately recover fine details in rapidly moving human limbs. Overall, our results exhibit closer edge and detail texture resemblance to real images, especially when dealing with fast-moving subjects.

4.2.2 Deblurring Results on the HIDE Dataset

Table 3

Performance comparison on HIDE dataset. Our method is trained on the GoPro dataset and applied to the HIDE dataset. The best and second-best scores are highlighted and underlined

Method	PSNR$\uparrow $	SSIM$\uparrow $	Params	Time	Year
DeepDeblur[1]	25.73	0.874	12	86	2017
DeblurGan-v2[26]	27.51	0.848	61	40	2018
DMPHN[56]	27.79	0.864	22	217	2019
DBGAN[59]	28.97	0.913	12	899	2020
MIMO-UNet[60]	29.99	0.906	16	21	2020
DeepRFT[38]	30.99	0.919	10	236	2021
MPRNet[40]	30.96	0.917	20	98	2021
HINet[41]	30.33	0.909	89	24	2022
DGUNet[61]	30.96	0.918	18	-	2022
Uformer-B[46]	30.89	0.919	51	307	2021
Restormer[47]	31.22	0.921	26	778	2022
MFFDNet[44]	30.16	0.932	38	119	2023
Our	30.79	0.939	58	314	2023

Quantitative Evaluation: Similar to metrics work[42], we evaluated our model trained on the GoPro dataset directly on the HIDE dataset. Results in Table 3 show that HPAT obtains higher PSNR and SSIM values. Although the PSNR is slightly lower than the best-performing prior model, the SSIM metric surpasses all competing methods. Notably, while achieving an improvement of 0.46 dB in PSNR on the GoPro dataset, HPAT also attains an increase of 7.4% in SSIM. Remarkably, the model’s untrained performance on this dataset highlights its superior generalization capability.

In order to solve the problem of suboptimal PSNR performance on the HIDE dataset, we conducted an in-depth analysis of the characteristics of the dataset and the design direction of the HPAT model. The HIDE dataset centers around images featuring individuals with concealed or partially obscured features, encompassing complex backgrounds, occlusions, and varying degrees of blur. These factors create challenges for PSNR assessment. In this context, the HPAT model emphasizes global information processing. Its performance in pixel-level accuracy may be slightly less pronounced compared to models that focus on reconstructing local pixel-level details. It is worth noting the difference between the evaluation metrics: PSNR emphasizes pixel-level errors, while SSIM focuses more on perceptual features such as image structure, contrast, and texture. Since the HPAT model prioritizes maintaining the consistency of the overall structure, when the image contains complex scenes and diverse content, its pixel-level reconstruction performance may be relatively poor compared with methods that pay more attention to local details.

Qualitative Evaluation: Visual comparison results between our HPAT method and evaluation methods on the HIDE dataset are displayed in Figs. 8 and 9. For facial deblurring, our approach not only restores rich detail information but also avoids chromatic aberration and artifact occurrence compared to other methods.

4.2.3 Deblurring Results on the RealBlur Dataset

Table 4

The average PSNR and SSIM on the RealBlur dataset. Applying our GoPro trained model directly on the RealBlur set

Methods	RealBlur_J		RealBlur_R		Average		Year
	PSNR$\uparrow $	SSIM$\uparrow $	PSNR$\uparrow $	SSIM$\uparrow $	PSNR$\uparrow $	SSIM$\uparrow $
DeblurGan-v2[26]	26.68	0.815	33.41	0.928	30.05	0.872	2018
DMPHN[56]	26.75	0.825	33.21	0.936	29.98	0.881	2019
DeepRFT[38]	26.66	0.823	34.03	0.943	30.35	0.883	2021
MPRNet[40]	26.51	0.820	33.91	0.942	30.21	0.881	2021
HINet[41]	26.36	0.800	33.80	0.938	30.08	0.869	2021
MSSNet[63]	26.59	0.826	33.93	0.945	30.26	0.886	2022
DGUNet[61]	26.60	0.824	33.96	0.943	30.28	0.884	2022
Uformer-B[46]	26.65	0.828	33.85	0.943	30.25	0.886	2021
Restormer[47]	26.63	0.823	33.98	0.946	30.31	0.885	2022
MFFDNet[44]	28.42	0.864	35.71	0.949	32.07	0.910	2023
Ours	28.76	0.876	36.02	0.954	32.39	0.915	2023

Quantitative Evaluation: Consistent with previous practice [46, 47], we further verified our proposed model’s generalization capability on the RealBlur dataset. It’s important to note that we directly employed the model trained on the GoPro dataset for evaluation on the RealBlur dataset, without additional training. Table 4 summarizes quantitative analysis results on the RealBlur_J and RealBlur_R datasets. The outcomes underscore the robust generalization ability of our model.

Qualitative Evaluation: Figs. 10, 11, 12 and 13 showcase the deblurring efficacy of our approach on the RealBlur dataset and compare it with contrastive methods. Satisfactory results were achieved in both high-light and low-light scene blurring scenarios. Notably, our method successfully restored text and patterns with greater precision and authenticity.

In the depicted results from Figs. 14 and 15, we showcase the deblurring effectiveness of our model on autonomously captured images in real-world scenarios. In comparison to alternative models, our outcomes exhibit superiority in terms of both clarity and visual appeal. The reconstructed scenes not only accurately restore the shapes of objects such as clocks, railings, and windows but also capture intricate details like scales, pointers, and bars to a considerable extent.

4.3 Ablation Experiments

In this section, we conducted ablation experiments on each module of the HPAT structure to analyze their effectiveness and contributions. These ablation experiments were solely performed on the GoPro dataset. For efficient evaluation, we conducted training on image blocks of size 128–128.

4.3.1 Hierarchical Attention Fusion Module

Overall, the complete integration of the Hierarchical Attention Fusion Module led to a PSNR gain of 0.16 dB, as presented in Table 5h. Subsequently, we incrementally introduced the functional blocks of the module to assess their individual effectiveness. However, Table 5f indicates that blindly incorporating the ASTB during the feature extraction phase to capture long-range dependencies between pixels led to a decrease in metrics. This phenomenon might stem from the lack of non-linear modeling, thereby failing to adequately integrate and filter features, resulting in irrelevant features interfering with the final enhanced feature. Table 5g corroborates this speculation but still presents inconsistencies in generated features at higher levels, causing difficulty in effectively utilizing different hierarchical feature information, especially lacking efficient contextual information in deeper layers. The CFIM further mitigated these shortcomings, facilitating global feature modeling and integration, effectively overcoming limitations in early-stage feature processing within the encoder-decoder structure, as indicated in Table 5h.

To validate the effectiveness of prioritizing global information over local features, we conducted experiments. We replaced the global strategy of the HAFM module with local strategies using either a single convolution or multiple convolutions. As indicated in Table 6, when prioritizing global information, the HAFM module exhibited a minimum 0.14dB improvement in PSNR compared to the local strategy.

Considering global consistency, blurriness usually impacts the entire regions of an image. Thus, prioritizing global information aids in restoring the overall consistency and coherence of the image. For contextual understanding, global information assists in comprehending the scenes and objects within an image. It guides the deblurring algorithm to recover details more effectively in crucial areas while maintaining overall visual harmony. Concerning advanced feature extraction, global information facilitates extracting high-level features that encompass the overall structure and content of the image. This offers improved guidance for deblurring processes. When dealing with scenarios involving extensive blurring, like those caused by camera motion or prolonged exposure, which often affect the entire image, deblurring methods relying on local features may perform poorly. Attention to global information proves more effective in such cases.

4.3.2 Fusion Feedforward Network

To assess the effectiveness and rationale of the Fusion Feedforward Network (F3N), we first examined the impact of fusion operations within F3N. This involved a comparison of various block partitioning methods and sizes. From the results in Table 7, we observe that smaller combinations of ’Number of Patches’ and ’Patch Size’ may result in a reduced receptive field, inadequate for capturing global image information. This limitation could impede the effective transmission and capture of global features, leading to poor PSNR performance, particularly in deblurring tasks with extensive contextual information. Conversely, choosing larger combinations of ’Number of Patches’ and ’Patch Size’ could result in an overly large receptive field, introducing an excess of contextual information and complicating information fusion.

Table 5

Component analysis of HPAT

Variant	Component	PSNR$\uparrow $(dB)
Baseline(a)	Model with Resblock(d)	32.72
Encoder-Decoder(b)	LETB(W-MSA+F3N)(e)	32.83
	ASTB(CA-MSA+FFN)(f)	32.78
HAFM(c)	ASTB(CA-MSA+F3N))(g)	32.82
	ASTB(CA-MSA+FEN)+CFIM(h)	32.99

Dividing the image feature tensor into 9 blocks of size 7x7 resulted in optimal performance according to the PSNR metric. This choice achieves a balance between receptive field size and information fusion, offering an effective and efficient mechanism for feature learning. Similarly, as seen in the results presented in Table 8, this approach demonstrated superior performance in the PSNR metric, underscoring the effectiveness of this partitioning scheme in information transmission and preservation.

Table 6

Dismantling analysis of the efficacy of local features and global information components

Variant	Component	PSNR$\uparrow $(dB)
	Conv	32.87
HAFM	Conv+Conv+Conv	32.91
	ASTB(CA-MSA+FEN)+CFIM	32.99

Table 7

Study on the ablation of feedforward networks by different block partitioning methods

Number of Patches	Patch size	PSNR$\uparrow $(dB)
4	14	32.93
9	7	32.99
16	5	32.97
25	4	32.94

Table 8

Ablation of feedforward networks with different block sizes

Number of Patches	Patch size	PSNR$\uparrow $(dB)
	5	32.95
9	7	32.99
	10	32.98
	14	32.96

Table 9

Ablation of feedforward networks in encoder-decoder

Variant	Component	PSNR$\uparrow $(dB)
Feedforward	LETB(W-MSA+FFN)	32.87
Network	LETB(W-MSA+LEFF)	32.95
	LETB(W-MSA+F3N)	32.99

To further validate the effect of the fusion feedforward network within the encoder-decoder structure, we maintained other network structures and hyperparameters constant, while solely replacing the feedforward network for the ablation study with the following three options: (1) FFN, (2) LEFF, and (3) F3N. Table 9 showcases their respective outcomes, demonstrating performance improvements that validate the positive impact of this design. Our Fusion Feedforward Network, when compared to the commonly found feedforward networks in Transformer structures, effectively reduces jagged boundary artifacts through multiple aggregations and integrations of overlapping pixel regions. Simultaneously, it incorporates more surrounding pixel information, thereby enhancing the richness of feature representation.

5 Conclusion and Prospects

In this paper, we introduce a novel multi-level network structure (HPAT) based on Transformer, which aims to fully exploit local and non-local image features. Specifically, we propose the Hierarchical Attention Fusion Module (HAFM) to leverage global context information during feature extraction. This module enhances feature expression and improves the learning ability of the encoder-decoder architecture. Next, we introduce the fused feedforward network (F3N), which uses folding and unfolding operations for feature fusion and reconstruction. This technique prevents aliasing artifacts in image deblurring, captures a wider range of dependencies, and produces more accurate and robust deblurring results. Our approach demonstrates performance advantages, interpretability, and generalization capabilities, and is supported by validation on a variety of datasets. Finally, we conduct thorough ablation experiments to confirm the effectiveness and contribution of each module in improving image deblurring results. Our work mainly focuses on processing global information, and still faces performance challenges when processing small objects or scenes with complex details. In future research, our focus will be on optimizing the trade-off between global and local information.

Acknowledgements

This work has been partially supported by the Sichuan science and technology program(Grant No: 2022YFG0095, 2023YFG0300).

Declarations

Conflict of interest

Not applicable.

Ethical approval

Not applicable.

Not applicable

Not applicable.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel DialGNN: Heterogeneous Graph Neural Networks for Dialogue Classification

Nächster Artikel Deep Self-Supervised Attributed Graph Clustering for Social Network Analysis

Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3883–3891

Lai WS, Shih Y, Chu LC et al (2022) Face deblurring using dual camera fusion on mobile phones. ACM Transact Graph (TOG) 41(4):1–16CrossRef

Li Y, Li X (2023) Automatic segmentation using deep convolutional neural networks for tumor CT images. Int J Pattern Recogn Artif Intell 37(03):2352003CrossRef

McManamon P, Piracha U, Jameson S et al (2023) Special section guest editorial: autonomous vehicles. Opt Eng 62(3):031201–031201CrossRef

Yang M, Jiao L, Liu F et al (2019) Transferred deep learning-based change detection in remote sensing images. IEEE Transact Geosci Remote Sens 57(9):6960–6973CrossRef

Rudin LI, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenom 60(1–4):259–268MathSciNetCrossRef

Dabov K, Foi A, Katkovnik V et al (2007) Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transact Image Process 16(8):2080–2095MathSciNetCrossRef

Hyun Kim T, Ahn B, Mu Lee K (2013) Dynamic scene deblurring. In: Proceedings of the IEEE international conference on computer vision, pp 3160–3167

Xu L, Zheng S, Jia J (2013) Unnatural l0 sparse representation for natural image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1107–1114

10.

Pan J, Hu Z, Su Z et al (2016) Soft-segmentation guided object motion deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 459–468

11.

He K, Sun J, Tang X (2010) Single image haze removal using dark channel prior. IEEE Transact Pattern Anal Mach Intell 33(12):2341–2353

12.

Gu S, Zhang L, Zuo W et al (2014) Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2869

13.

Dong W, Zhang L, Shi G et al (2011) Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transact Image Process 20(7):1838–1857MathSciNetCrossRef

14.

Xie J, Hou G, Wang G et al (2021) A variational framework for underwater image dehazing and deblurring. IEEE Transact Circuits Syst Video Technol 32(6):3514–3526CrossRef

15.

He R, Zheng WS, Tan T et al (2013) Half-quadratic-based iterative minimization for robust sparse representation. IEEE Transact Pattern Anal Mach Intell 36(2):261–275

16.

Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRef

17.

Zhang Y, Li K, Li K et al (2019) Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082

18.

Li J, Tan W, Yan B (2021) Perceptual variousness motion deblurring with light global context refinement. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4116–4125

19.

Zamir S.W, Arora A, Khan S et al (2020) Learning enriched features for real image restoration and enhancement. In: Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXV 16, pp 492–511

20.

Gao Z, Li E, Wang Z et al (2021) Object reconstruction based on attentive recurrent network from single and multiple images. Neural Process Lett 53:653–670CrossRef

21.

Park D, Kang D.U, Kim J et al (2020) Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In: European Conference on Computer Vision, pp 327–343

22.

Suin M, Purohit K, Rajagopalan AN (2020) Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3606–3615

23.

Lim S, Kim J, Kim W (2020) Deep spectral-spatial network for single image deblurring. IEEE Signal Process Lett 27:835–839CrossRef

24.

Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision-ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp 818–833

25.

Li X, Wu J, Lin Z et al (2018) Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: Proceedings of the European conference on computer vision (ECCV), pp 254–269

26.

Kupyn O, Martyniuk T, Wu J et al (2019) Deblurgan-v2: deblurring (orders-of-magnitude) faster and better. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8878–8887

27.

Wang M, Hou S, Li H et al (2019) Generative image deblurring based on multi-scaled residual adversary network driven by composed prior-posterior loss. J Vis Commun Image Represent 65:102648CrossRef

28.

Jiang G, Chen H, Wang C et al (2022) Transformer network intelligent flight situation awareness assessment based on pilot visual gaze and operation behavior data. Int J Pattern Recogn Artif Intell 36(05):2259015CrossRef

29.

Liang J, Cao J, Sun G et al (2021) Swinir: image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1833–1844

30.

Chu X, Tian Z, Wang Y et al (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366

31.

Yuan L, Hou Q, Jiang Z et al (2022) Volo: vision outlooker for visual recognition. IEEE Transact Pattern Anal Mach Intell 45(5):6575–6586

32.

Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310

33.

Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

34.

Morikawa C, Kobayashi M, Satoh M et al (2021) Image and video processing on mobile devices: a survey. Vis Comput 37(12):2931–2949CrossRef

35.

Schuler CJ, Hirsch M, Harmeling S et al (2015) Learning to deblur. IEEE Transact Pattern Anal Mach Intell 38(7):1439–1451CrossRef

36.

Tao X, Gao H, Shen X et al (2018) Scale-recurrent network for deep image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8174–8182

37.

Kupyn O, Budzan V, Mykhailych M et al (2018) Deblurgan: Blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8183–8192

38.

Mao X, Liu Y, Shen W, et al (2021) Deep residual fourier transformation for single image deblurring. arXiv preprint arXiv:2111.11745 2(3), 5

39.

Chi L, Jiang B, Mu Y (2020) Fast fourier convolution. Adv Neural Inf Process Syst 33:4479–4488

40.

Zamir S.W, Arora A, Khan S et al (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14821–14831

41.

Chen L, Lu X, Zhang J et al (2021) Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 182–192

42.

Tu Z, Talebi H, Zhang H et al (2022) Maxim: multi-axis MLP for image processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5769–5780

43.

Dauphin Y.N, Fan A, Auli M et al (2017) Language modeling with gated convolutional networks. In: International conference on machine learning, pp 933–941

44.

Guo C, Wang Q, Dai HN et al (2023) Multi-stage feature-fusion dense network for motion deblurring. J Vis Commun Image Represent 90:103717CrossRef

45.

Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

46.

Wang Z, Cun X, Bao J et al (2022) Uformer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17683–17693

47.

Zamir S.W, Arora A, Khan S et al (2022) Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5728–5739

48.

Lee H, Choi H, Sohn K et al (2023) Cross-scale KNN image transformer for image restoration. IEEE Access 11:13013–13027CrossRef

49.

Wu H, Xiao B, Codella N et al (2021) CVT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22–31

50.

Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588

51.

Ho J, Kalchbrenner N, Weissenborn D et al (2019) Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180

52.

Wang H, Zhu Y, Green B et al (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: European conference on computer vision, pp 108–126

53.

Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transact Image Process 13(4):600–612CrossRef

54.

Shen Z, Wang W, Lu X et al (2019) Human-aware motion deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5572–5581

55.

Rim J, Lee H, Won J et al (2020) Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp 184–201

56.

Zhang H, Dai Y, Li H et al (2019) Deep stacked hierarchical multi-patch network for image deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5978–5986

57.

Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint . arXiv:1711.05101

58.

Loshchilov I, Hutter F (2016) SGDR: stochastic gradient descent with warm restarts. arXiv preprint . arXiv:1608.03983

59.

Zhang K, Luo W, Zhong Y et al (2020) Deblurring by realistic blurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2737–2746

60.

Cho SJ, Ji SW, Hong JP et al (2021) Rethinking coarse-to-fine approach in single image deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4641–4650

61.

Mou C, Wang Q, Zhang J (2022) Deep generalized unfolding networks for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17399–17410

62.

Chen L, Chu X, Zhang X et al (2022) Simple baselines for image restoration. In: European conference on computer vision, PP 17–33

63.

Kim K, Lee S, Cho S (2022) MSSNET: multi-scale-stage network for single image deblurring. In: European conference on computer vision, pp 524–539

Titel: Hierarchical Patch Aggregation Transformer for Motion Deblurring
verfasst von: Yujie Wu
Lei Liang
Siyao Ling
Zhisheng Gao
Publikationsdatum: 01.04.2024
Verlag: Springer US
Erschienen in: Neural Processing Letters / Ausgabe 2/2024
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-024-11594-0

Methods	RealBlur_J		RealBlur_R		Average		Year
	PSNR\(\uparrow \)	SSIM\(\uparrow \)	PSNR\(\uparrow \)	SSIM\(\uparrow \)	PSNR\(\uparrow \)	SSIM\(\uparrow \)
DeblurGan-v2[26]	26.68	0.815	33.41	0.928	30.05	0.872	2018
DMPHN[56]	26.75	0.825	33.21	0.936	29.98	0.881	2019
DeepRFT[38]	26.66	0.823	34.03	0.943	30.35	0.883	2021
MPRNet[40]	26.51	0.820	33.91	0.942	30.21	0.881	2021
HINet[41]	26.36	0.800	33.80	0.938	30.08	0.869	2021
MSSNet[63]	26.59	0.826	33.93	0.945	30.26	0.886	2022
DGUNet[61]	26.60	0.824	33.96	0.943	30.28	0.884	2022
Uformer-B[46]	26.65	0.828	33.85	0.943	30.25	0.886	2021
Restormer[47]	26.63	0.823	33.98	0.946	30.31	0.885	2022
MFFDNet[44]	28.42	0.864	35.71	0.949	32.07	0.910	2023
Ours	28.76	0.876	36.02	0.954	32.39	0.915	2023

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related work

2.1 CNN Based Image Deblurring

2.2 Transformers based image deblurring

2.3 Feedforword Network

3 Hierarchical Patch Aggregation Transformer for Motion Deblurring

3.1 Overall Pipeline

3.2 Hierarchical Attention Fusion Module

3.2.1 Axis-Selective Transformer Block

3.2.2 Cross-Layer Feature Interaction Mechanism

3.3 Fusion Feedforward Network

4 Experience and Results

4.1 Datasets and Experimental Settings

4.2 Results and Analysis

4.2.1 Deblurring Results on the GoPro Dataset

4.2.2 Deblurring Results on the HIDE Dataset

4.2.3 Deblurring Results on the RealBlur Dataset

4.3 Ablation Experiments

4.3.1 Hierarchical Attention Fusion Module

4.3.2 Fusion Feedforward Network

5 Conclusion and Prospects

Acknowledgements

Declarations

Conflict of interest

Ethical approval

Consent to participation

Consent for publication

Publisher's Note

Weitere Artikel der Ausgabe 2/2024

A New Adaptive Robust Modularized Semi-Supervised Community Detection Method Based on Non-negative Matrix Factorization

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Deep Reinforcement Learning Model for Stock Portfolio Management Based on Data Fusion

Non-Uniformly Weighted Multisource Domain Adaptation Network For Fault Diagnosis Under Varying Working Conditions

Enhanced Coalescence Backdoor Attack Against DNN Based on Pixel Gradient

Detection of Image Tampering Using Deep Learning, Error Levels and Noise Residuals