Skip to main content
Erschienen in: Neural Processing Letters 2/2024

Open Access 01.04.2024

Hierarchical Patch Aggregation Transformer for Motion Deblurring

verfasst von: Yujie Wu, Lei Liang, Siyao Ling, Zhisheng Gao

Erschienen in: Neural Processing Letters | Ausgabe 2/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The encoder-decoder framework based on Transformer components has become a paradigm in the field of image deblurring architecture design. In this paper, we critically revisit this approach and find that many current architectures severely focus on limited local regions during the feature extraction stage. These designs compromise the feature richness and diversity of the encoder-decoder framework, leading to bottlenecks in performance improvement. To address these deficiencies, a novel Hierarchical Patch Aggregation Transformer architecture (HPAT) is proposed. In the initial feature extraction stage, HPAT combines Axis-Selective Transformer Blocks with linear complexity and is supplemented by an adaptive hierarchical attention fusion mechanism. These mechanisms enable the model to effectively capture the spatial relationships between features and integrate features from different hierarchical levels. Then, we redesign the feedforward network of the Transformer block in the encoder-decoder structure and propose the Fused Feedforward Network. This effective aggregation enhances the ability to capture and retain local detailed features. We evaluate HPAT through extensive experiments and compare its performance with baseline methods on public datasets. Experimental results show that the proposed HPAT model achieves state-of-the-art performance in image deblurring tasks.
Hinweise
These authors contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In recent years, driven by the continuous advancement of imaging equipment and computing technology, the demand for high-quality images has become increasingly urgent. However, real-world imaging scenarios present various uncertainties and disturbances, such as camera shake[1], relative motion, and lighting changes[2], which inevitably lead to the appearance of image blur. These blurring phenomena significantly reduce the ability of humans and devices to perceive information in images. Therefore, the accuracy and precision of advanced vision tasks such as image segmentation [3], autonomous driving [4], satellite monitoring [5], etc. will be adversely affected.
To solve this classic ill-posed problem, many model-based methods [614] model the image deblurring tasks as constrained optimization problems [15, 16]. These methods require carefully designed strong priors and regular features to constrain the possible solutions to the solution space. However, these methods with complex iterative solution processes face limitations in practical applications and are often difficult to adapt to various types of ambiguities and scenarios. Their robustness is weak when faced with complex situations.
With the rapid development of large-scale data sets and convolutional neural networks (CNN), learning-based methods[1723] have achieved remarkable achievements in the field of image restoration. These methods exploit abundant image data to implicitly learn the mapping relationship between blurred and sharp images. By training models to minimize the discrepancy between input and output images, they effectively achieve deblurring and image detail restoration. However, the practical applicability of these methods is limited by the feature extraction method and convolution operation characteristics of CNN [24, 25]. They exhibit sensitivity to input image size, making them prone to introducing distortion and noise while having difficulty processing global information effectively. Therefore, researchers have explored multi-scale processing of input images to enable models to capture information at various scales. In particular, the combination of encoder-decoder structures with residual learning or generative adversarial networks [26, 27] has been attempted to enhance performance. However, the design of the encoder-decoder architecture [20] mainly focuses on recovering image features layer by layer, but this mechanism that only relies on basic local feature extraction cannot fully capture the global semantic information in the source image. This limitation hinders the model’s ability to accurately understand the underlying semantic information of blurred images. As the structure deepens, it often fails to preserve the fine-grained local features of the input image.
Recently, a robust feature extractor based on the self-attention mechanism, called the Transformer model, has attracted widespread attention in the field of computer vision [2830]. Transformer excels at modeling long-term dependencies and correlations between different parts of an image, making them particularly effective for intensive tasks such as image deblurring. Nonetheless, the quadratic computational complexity issue when dealing with large size feature maps or increasing the number of attention heads still needs to be addressed[31]. Furthermore, existing Transformer-based methods cannot dynamically identify feature importance. Features at different levels may have different semantics and representation capabilities, and being regarded as equivalent entities will lead to information confusion and loss. Simply concatenating or adding features from different layers often fails to fully explore potential interdependencies across layers, thus limiting the model’s ability to solve complex image deblurring challenges. In addition, The feedforward network of Transformer usually focuses on implementing nonlinear transformations [32, 33], but within a limited receptive field, it adopts the same set of learning parameters for all inputs, regardless of the underlying fine-grained features in overlapping image patches. This limitation results in a waste of local features, which may lead to artifacts and distortions.
In this paper, a novel Hierarchical Patch Aggregation Transformer Network (HPAT) is proposed, which aims to further explore the performance of transformer networks for image deblurring tasks. Specifically, in the feature extraction stage, we designed an additional hierarchical attention fusion module to adaptively integrate features at different levels and enhance the network’s ability to model global semantic information within the image. The hierarchical attention fusion module consists of cascaded axis-selective Transformer blocks and a cross-layer feature interaction mechanism. Since objects in an image may change position and orientation while maintaining their semantic information, we propose to use axis-selective Transformer blocks with shared weights to compute image similarity. This method effectively captures the spatial relationships and structural information between features with linear complexity. The hierarchical attention fusion module facilitates the cross-layer flow of attention information and adaptively fuses features at different levels, which solves the singularity and limitations of the original Transformer model in feature processing. Our work enables higher-level semantic learning at lower computational cost. In addition, we updated the feedforward module in the Transformer model to Fusion Feedforward Network (F3N), which aggregates more token information to improve performance and achieve restoration of smooth boundaries and overall semantics.
The main contributions of our work are summarized as follows:
  • In the feature extraction stage, the Hierarchical Attention Fusion Module (HAFM) is proposed to solve the problem of lack of semantic information in the early stages of the encoder-decoder structure. This module utilizes an axis-selective Transformer block with linear complexity to model the spatial relationship between pixels, which adaptively enhances important features in global space. Subsequently, the cross-layer feature interaction mechanism realizes similarity representation and enhancement of features at across different levels.
  • F3N is used to replace the standard feedforward network structure in the Transformer model. F3N is able to match and align sub-token segments corresponding to the same pixel position, and can aggregate information from different patches for the same pixel position. The problem of losing fine-grained local features during the decoding process is solved without adding additional learnable parameters.
  • HPAT introduces a novel Transformer model to capture local and global spatial information effectively. Through extensive experiments and ablation studies, HPAT has demonstrated competitive results on image deblurring, as shown in Fig. 1.

2.1 CNN Based Image Deblurring

Digital images are composed of high-dimensional pixels. Although the location and orientation of objects in an image may change, the semantic content is usually remain invariant. Convolutional Neural Network (CNN) is demonstrated remarkable performance [34] across various image processing tasks by leveraging the properties of local patterns and weight sharing. The CNN not only adapts to translation invariance within images [24], it also learns to handle variations in images through transfer learning and data augmentation.
In the domain of image deblurring, researchers have introduced many approaches based on the CNN. Sun, Schuler et al. [35] employed the CNN for image deblurring, and yielded specific results. Nah et al. [1] proposed the DeepDeblur, which directly learns the deblurring process in an end-to-end manner from image data. Tao et al. [36] introduced the SRN, which establishes residual skip connections at various scales to capture multi-scale features from original blurring images. It focuses on differences between the blurred and sharped regions, rather than differences across the entire images.
Furthermore, alternative approaches such as Orest et al. [37] introduced adversarial learning to address pixel-level and content-level errors. Mao et al. [38] introduced the DeepRFT, which redesigns residual blocks by combining channel-level fast Fourier transform[39] and models high-frequency and low-frequency information simultaneously. Zamir et al. [40] proposed the MPRNet, a parallelizable multi-stage architecture that achieves multi-resolution information fusion in the image reconstruction stage. Chen et al. [41] introduced the HINet, which recalibrates the mean and variance of features by incorporating instance normalization with learnable affine parameters. MAXIM proposed by Tu et al. [42] has a symmetric encoder-decoder structure and utilized axis gating [43] and spatial attention gating weights for fixed window and grid deployment, which promotes global and local spatial interactions. Guo et al. [44] introduced the MFFDNet, which used a dense connection structure for feature fusion and enhances the deblurring effect by reusing extracted features.

2.2 Transformers based image deblurring

The Transformer architecture with global receptive field has been widely used in the field of vision. Vision Transformer was proposed by Dosovitskiy et al. [45] and applied to image data processing, and demonstrated strong adaptability to different image sizes. However, Vision Transformer divides the image into fixed-size blocks and performs linear transformation, which may lead to information loss, especially for pixel-level tasks.
To solve this problem, IPT [32] was proposed as an end-to-end model that improves performance by introducing self-attention mechanisms into traditional Convolutional Neural Network feature extractors. Liu et al. [33] introduced the Swin Transformer, and incorporated local window attention mechanisms and shift operations to prevent information loss and facilitated feature propagation across various resolutions. Nonetheless, an excessive emphasis on global information modeling can lead to an imbalance of local information. To address this problem, Wang et al. [46] introduced the Uformer, a structured encoder-decoder architecture constructed with local-enhanced window Transformer blocks. This design effectively captures both the local and global dependencies for image restoration tasks. Similarly, the Restormer structure proposed by Zamir [47] employs gate mechanisms, local content mixing, cross-feature covariance computation, and local context blending to enhance information flow and global connectivity modeling. Additionally, CS-Kit [48] groups similar blocks, employs spatial local attention among these blocks, and utilizes Scale-aware Patch Embedding (SPE) to achieve mixed-scale patch aggregation, catering to requirements of locality, non-locality, and cross-scale aggregation.

2.3 Feedforword Network

In the Transformer architecture [49], feedforward networks are the key components of both the encoder and decoder blocks. A feedforward network consists of classic fully connected layers that perform nonlinear transformations on the output of the self-attention layer. This structure can better adapt to data complexity and generate new representations to propagate to subsequent layers. The feedforward network remains a fixed size independent of the input dimension, allowing the Transformer network to fit sequences of different lengths. This fixed-size feature facilitates the application of the Transformer network to sequences of different lengths.
However, for image deblurring tasks, feedforward networks must learn to recognize and restore fine-grained details lost during the blurring process. The traditional feedforward network is difficult to generate sharp images by fully utilizing the contextual information in the input features[50]. This limitation can lead to poor results and reduce the efficiency of detail restoration. Furthermore, the receptive field of traditional feedforward networks is limited, which hinders their ability to effectively capture long-range dependencies in input images. To address these shortcomings, feedforward network variants have emerged in the Transformer architecture [46, 47] in recent years. In Sects.  3.1 and  4.3.2 we provide an in-depth comparison between our proposed F3N and these variants.

3 Hierarchical Patch Aggregation Transformer for Motion Deblurring

Our proposed Hierarchical Patch Aggregation Transformer Network (HPAT) is an efficient multi-level network based on Transformer blocks, which aims to fully exploit the local and non-local features of images, thereby improving image deblurring capabilities. The structure of HPAT is shown in Fig.  2.

3.1 Overall Pipeline

Given a degraded image \(I \in \mathbb {R}^{3 \times H \times W}\), we initially use a linear mapping block to learn the inter-pixel correlations within local regions, and obtain shallow features \(F_0 \in \mathbb {R}^{H \times W \times C}\). Subsequently, the Hierarchical Attention Fusion Module (HAFM) transforms shallow features \(F_0\) into enhanced features \(F_E\). In this module, \(F_0\) captures spatial relationships sequentially along the height and width dimensions of the image through M Axis-Selective Transformer Blocks (ASTB). This effectively filters out more informative details, producing intermediate features \(F_1,F_2,...,F_M \in \mathbb {R}^{H \times W \times C}\). Subsequently, the cross-layer feature interaction mechanism aggregates activations from different layers to form the enhanced map \(F_E \in \mathbb {R}^{H \times W \times C}\). Next, the optimized \(F_E\) is passed through a symmetric 4-stage encoder-decoder structure. Each encoder-decoder stage is composed of a pair of Local Enhancement Transformer Blocks (LETB). It consists of a Window-based Multi-head Self-Attention (W-MSA) residual module and a feedforward network residual module to obtain the encoder features \(E_i \in \mathbb {R}^{\frac{H}{2 ^i} \times \frac{W}{2^ i} \times 2^i C}\) and decoder features \(D_i \in \mathbb {R}^{\frac{H}{2^i} \times \frac{W}{2^i} \times 2^ i C}\), where \(i=0,1,2,3\). To recover the lost high-frequency information, skip connections are established between the encoder features at the corresponding level of each encoder-decoder layer. Finally, a clean version of the image is generated by adding the residual \(Re\in \mathbb {R}^{3 \times H \times W}\) to the input image.
We use pixel-level Charbonnier loss [19] to ensure that HPAT reduces artifacts while maintaining sharpness.
$$\begin{aligned} \begin{aligned} \mathbb {L}(G_i, {I'}_i) = \frac{1}{N} \sqrt{{\left\| G_i-{I}'_i \right\| }^2 + {\xi }^2}, \end{aligned} \end{aligned}$$
(1)
here, \(G_i\) represents the pixel values of the ground truth, \({I'}_i\) denotes the pixel values output by the model, and \(\xi =10^{-3}\) is utilized for smoothing the loss function.

3.2 Hierarchical Attention Fusion Module

In the encoder-decoder model built by the Transformer block [46, 47], the encoder stage usually compresses the input image into a lower-dimensional feature representation space, and then gradually reconstructs the image in the decoder stage. In addition, each layer of the encoder-decoder structure processes features independently, so that it cannot fully utilize global context information, thus limiting its feature representation capabilities. Therefore, we propose to integrate a hierarchical attention fusion module in the feature extraction stage to enhance the capability of the encoder-decoder structure. This strategic addition can more effectively exploit global contextual information, thereby enhancing feature representation capabilities and understanding of the overall semantic aspects of the image. The hierarchical attention fusion module includes the Axis-Selective Transformer Block (ASTB) and the Cross-layer Feature Interaction Mechanism (CFIM) module, as shown in Fig. 2.

3.2.1 Axis-Selective Transformer Block

To avoid excessive computational costs in the feature extraction stage, the ASTB divides the multi-head self-attention computation into operations along the height and width dimensions. Unlike existing methods [51, 52], our axis attention method is sequential and follows an ordered concatenation approach rather than executing in parallel. In addition, a filtering mechanism is proposed in the feed-forward network to adaptively select and emphasize key feature information. This improvement focuses attention more effectively on valuable features, thereby enhancing the model’s feature expression and reconstruction capabilities. As shown in Fig. 2, the information flow of ASTB can be explained as follows:
$$\begin{aligned} \begin{aligned} \hat{f^L}&= CA-MSA(LN(f^{L-1})) + f^{L-1}\\ f^L&= FEN(LN(\hat{f^{L}})) + \hat{f^L}, \end{aligned} \end{aligned}$$
(2)
where, LN denotes layer normalization operation, and \(f^L,\hat{f^{L}}\) denote the outputs of the cross-axis multi-head self-attention (CA-MSA) and the feedforward enhanced network (FEN) respectively.
Figure 3a illustrates the entire process of cross-axis multi-head self-attention (CA-MSA). To begin with, a combination of \(1\times 1\) and \(3\times 3\) depthwise convolutions is applied to the layer-normalized features \(X_0 \in \mathbb {R}^{H \times W \times C}\) for nonlinear mapping. This transforms the features into representations suitable for the attention mechanism, specifically queries (Q), keys (K), and values (V). The \(1\times 1\) convolution is utilized to map input features to a higher-dimensional space, while the \(3\times 3\) depthwise convolution is employed to capture more extensive contextual information within the input features. To perform the dot product operation along a specific axis and capture attention information along that axis, we reshape the dimensions of Q, K, and V to obtain \(Q'\in \mathbb {R}^{H \times W \times C}\),\(K'\in \mathbb {R}^{C \times H \times W}\) and \(V'\in \mathbb {R}^{C \times H \times W}\). This reshaping is done to simplify the computation of the dot product. The self-attention mechanism is utilized to compute correlations between different positions, capturing dependencies among features. These dependencies are commonly linked to spatial positions and channel expressions. Therefore, we need j independent heads to autonomously learn correlations among different channels. These heads are acquired by splitting \(Q', K'\), and \(V'\) along the channel dimension, represented as: \({{q'}^1,...,{q'}^j} \in \mathbb {R}^{H \times W \times (C/j)}\), \({{k'}^1,...,{k'}^j} \in \mathbb {R}^{(C/j) \times H \times W}\), \({{v'}^1,...,{v'}^j} \in \mathbb {R}^{(C/j) \times H \times W}\). The similarity between each pair of heads \(q^j\) and \(k^j\) is calculated through dot product operations, followed by normalization using the softmax function, resulting in the attention weight matrix \({A^j} \in \mathbb {R}^{H \times W} \) along the height axis. This matrix reveals the distribution of importance of input features along the height axis. Subsequently, through dot product operations between \(A^j\) and \(V'\) for each head, a weighted averaging operation on \(V'\) is achieved, emphasizing more important feature information along the height axis. Ultimately, concatenating the results of the weighted averaging along the channel dimension and applying reshape and \(1\times 1\) convolution operations result in the output feature \(X_h\) along the height axis. The formulation of this process is as follows:
$$\begin{aligned} \begin{aligned} X_h&= Conv_{1\times 1} Reshape Concat_{i=1}^j (softmax({{q'}^i {k'}^i}/\beta ){v'}^i), \end{aligned} \end{aligned}$$
(3)
where i denotes the i-th head of q’, k’, and v’, and \(\beta \) represents the scaling factor, the operation Concat denotes the concatenation process. This ensures that the feature representation at each position incorporates information from all height positions, providing global and comprehensive information for subsequent processing. Next, a reshape operation is performed to execute attention along the width axis.
In the processing of large-scale image data and real-time applications, it is crucial to reduce redundant information in network processing, enhance feature compactness, and improve representation efficiency. In this case, it becomes imperative to filter out features with relatively low information content during the feature extraction process. As shown in Fig. 3b, compared with Gated-Dconv feed-forward network (GDFN) [47], our Feature Enhancement Network (FEN) retains two parallel paths, each path contains gating mechanism to more accurately evaluate feature importance. This nuanced approach further augments the effectiveness of filtering. Given the normalized tensor \(X\in \mathbb {R}^{\hat{H}\times \hat{W}\times \hat{C}}\), we begin by evenly splitting the channel count into two halves, creating two pathways. Each pathway utilizes a combination of \(1\times 1\) convolutions and \(3\times 3\) depthwise convolutions to adeptly capture local information. Subsequently, the GELU activation function and element-wise multiplication operations are applied to control the features from the two branches, thereby filtering and emphasizing crucial information. This enables the network to learn essential details from the input data. Ultimately, information from the two branches is fused via element-wise summation. The formulation of FEN is as follows:
$$\begin{aligned} \begin{aligned} Gating = \varnothing \left( {{W_d}^0{W_p}^0X} \right)&\bigodot \left( {{W_d}^1{W_p}^1X} \right) + \varnothing \left( {{W_d}^1{W_p}^1X} \right) \bigodot \left( {{W_d}^0{W_p}^0X} \right) \\ Y&= {W_p}^2Gating(X) + X, \end{aligned} \end{aligned}$$
(4)
where \({W_p}^i(i=0,1,2)\) denotes a \(1\times 1\) convolution, \({W_d}^j(j=0,1)\) signifies a \(3\times 3\) depthwise convolution, \(\varnothing \) signifies the GELU activation function, and \(\bigodot \) represents element-wise multiplication operation.

3.2.2 Cross-Layer Feature Interaction Mechanism

In the feature extraction stage, features from different layers usually have varying representation capabilities and importance. However, if each layer processes features independently, it will constrain the diversity and depth of feature expressions, potentially compromising the ability of the encoder-decoder structure to capture global context.
As illustrated in Fig. 4, the concatenation of features from M individual Attention Selective Transformer Blocks (ASTB) along the channel dimension results in the reshaped feature \(F_M\in \mathbb {R}^{H\times W\times MC}\). Subsequently, feature transformation yields QKV tensors for calculating attention weights and feature fusion, expressed as follows:
$$\begin{aligned} \begin{aligned} Q = {W_Q}{F_M} \\ K = {W_K}{F_M} \\ V = {W_V}{F_M} \end{aligned} \end{aligned}$$
(5)
Where \({W_Q}\), \({W_K}\), and \({W_V}\) represent the weight matrices.
To derive a layer-wise correlation matrix, denoted as \(L_A \in \mathbb {R}^{M \times M}\), describing inter-layer relationships and regulating attention weight allocation among layers, we reshape the dimensions of Q, K, and V into \(M \times HWC\). Subsequently, the layer-wise correlation matrix LA is multiplied with the reshaped values V, facilitating the weighted fusion of features across different layers. This process aims to retain crucial feature information while eliminating irrelevant features. Finally, the output of the self-attention mechanism is fused with the original input features to comprehensively preserve the original information. The core steps of the Cross-Layer Feature Interaction Mechanism (CFIM) are as follows:
$$\begin{aligned} \begin{aligned} LA = Softmax(\frac{\hat{Q}\hat{K}}{\beta })\hat{V}\\ F_E = {W_p}^1 LA + F_M, \end{aligned} \end{aligned}$$
(6)
where \(\beta \) represents the scaling factor, \(\hat{Q}, \hat{K}, \hat{V}\) are the reshaped Q,K,V matrices, LA is the layer correlation matrix, \(F_M\) denotes the reshaped features concatenated along the channel dimension, \({W_p}^1\) represents a \(1\times 1\) convolution, and \(F_E\) is the output enhanced feature.

3.3 Fusion Feedforward Network

We have investigated three distinct architectures of feedforward networks, one of which emphasizes a global perspective and another which prioritizes local enhancement. However, we observe limitations in the performance of these two architectures in image deblurring tasks. To surmount these limitations, we introduce a novel fusion feedforward network (F3N) designed to integrate and reconstruct features using folding and unfolding operations.
First, we introduce the feedforward network of the traditional Vision Transformer (ViT)[29, 45]. As shown in Fig. 5a, it usually includes two linear layers and a GELU activation function. This structure somewhat limits the model’s ability to capture local information. The linear layers mainly conduct fully connected projections, which may result in the neglect of certain local relationships, particularly those related to spatial connections between pixels.
Next, inspired by pioneering research work[46], we adopt a locally enhanced feedforward network as shown in Fig. 5b. It begins with a linear projection using a \(1\times 1\) convolution operation to increase the feature dimensions of each token. Subsequently, the tokens after linear projection are reshaped into a 2D feature map, a \(3\times 3\) depthwise separable convolutional layer is used to capture local information. Finally, the tokens are remapped from the feature map, and \(1\times 1\) convolution is applied to reduce its dimensionality to facilitate integration with the Transformer structure to maintain the consistency of the overall model. By introducing the TokensToImg operation and deepwith convolution, this feedforward network is capable of better capturing information between adjacent pixels in 2D space, thereby achieving improved spatial feature extraction and propagation. However, despite the effectiveness of local convolutional operations for neighboring pixels within a limited receptive field, they constrain the network’s ability to learn fine-grained features across larger regions of the image. This limitation may result in suboptimal performance when addressing deblurring tasks that require a larger receptive field.
To address these concerns, we propose the Fusion Feedforward Network (F3N). Differing from the \(3\times 3\) depthwise separable convolutional layer utilized in Fig. 5b, we employs fold and unfold operations to partition the image into multiple blocks. It enhances the receptive field by fusing features of overlapping region pixels. As these two operations do not introduce trainable parameters and remain fixed transformation operations, they remain devoid of additional parameterization. Specifically, we aggregate corresponding values from adjacent patches to enhance the network’s ability to learn fine-grained features. This architectural design not only empowers the network to exploit broader contextual information but also enables the capture of extensive interdependencies across different parts of the image, culminating in more precise and robust deblurring outcomes. As illustrated in Fig.  5c, we apply a \(1\times 1\) convolution to each token for linear projection, increasing its feature dimension. Subsequently, the Fold operation divides tokens into 9 blocks of size \(7\times 7\). The Unfold operation then unfolds the divided blocks in a specific order, ensuring overlapping regions between these blocks. We create a normalizer tensor to balance feature values. Specifically, the normalizer is designed as a tensor with all elements equal to 1, ensuring that normalized weights are used in the fusion process. This guarantees reasonable weight allocation for pixel features in overlapping regions, preventing certain regions from dominating the fusion process and maintaining the rationality and stability of the fusion results. Finally, the fused feature tensor is reshaped to the original shape and undergoes linear transformation through a \(1\times 1\) convolution to obtain the output.

4 Experience and Results

In this section, we employ the widely recognized Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [53] metrics for quantitative comparison. We compare our proposed HPAT method with various state-of-the-art methods to evaluate its progress. Furthermore, we conduct comprehensive ablation experiments to thoroughly analyze how the design introduced in the previous section affects the effectiveness of image deblurring.

4.1 Datasets and Experimental Settings

(1) Datasets
The GoPro dataset [1] was used for algorithm trainning, which includes 2103 pairs of high-quality real-world image samples at a resolution of 1280–720, to train the HPAT model. Additionally, we select 1111 pairs of completely distinct test images from this dataset. These test images cover a wide range of scenarios and situations, including irregular object motion, close-range defocusing, backlighting, and other variations. To validate the model’s generalization capability, we test the model on 2025 pairs of images from the HIDE dataset [54], exhibiting blurriness due to inaccurate focus resulting from varying depths in the captured scenes. We also evaluate the model on 980 pairs of low-light or high-light scene images from the RealBlur_J dataset and 980 pairs of nighttime low-light scene images from the RealBlur_R dataset [55].
(2) Experimental settings
We utilize an A100 SXM4 GPU in a desktop computer and the PyTorch[56] version 1.9.0 to train the HPAT model end-to-end. No pretrained networks are utilized, and the AdamW optimizer [57] is employed to minimize the reconstruction error using default parameters. Specifically, images are partitioned into blocks of size 256–256. Data augmentation involving random horizontal flipping is applied to training samples with a batch size of 16. The initial learning rate is set at 2e-4, and a cosine annealing strategy [58] is employed to update the learning rate, reducing it to 1e-6 after 3000 epochs. During the feature extraction stage, the number of M is 3, with the depth of ASTB varying as 1, 2, 4, and the number of attention heads as 1, 2, 4. In the 4-level encoder-decoder stage, the Encoder depth varies as 1, 2, 8, 8, while the attention heads vary as 1, 2, 4, 8.
(3) Parameters and detection time
The model parameters and inference time of the proposed model and comparative model were calculated using the GoPro dataset, as shown in Table 1. Our model possesses a moderate number of parameters, and an inference time of 328 ms, which demonstrating a relatively efficient performance when processing a single input. The moderate parameter count and the relative efficiency in inference time render our model feasible in scenarios requiring high-quality deblurring.
Table 1
A comparison of the parameter count and inference time for different architectures on the GoPro dataset
Method
Params(M)
Time(ms)
Year
DeblurGan-v2[26]
61
42
2018
DMPHN[56]
22
212
2019
DBGAN[59]
12
909
2020
MIMO-UNet[60]
16
22
2020
DeepRFT[38]
10
242
2021
MPRNet[40]
20
104
2021
HINet[41]
89
31
2022
DGUNet[61]
18
2022
NAFNet[62]
68
92
2022
IPT[32]
114
2020
Uformer-B[46]
51
310
2021
Restormer[47]
26
798
2022
MFFDNet[44]
38
123
2023
Our
58
328
2023
Table 2
Quantitative comparisons of our models with the SOTA deblurring methods on the GoPro dataset. The best and 2nd best results are highlighted and underlined
Method
PSNR\(\uparrow \)
SSIM\(\uparrow \)
Year
DeblurGan-v2[26]
29.08
0.873
2018
DMPHN[56]
30.45
0.902
2019
DBGAN[59]
31.10
0.942
2020
MIMO-UNet[60]
32.44
0.933
2020
DeepRFT[38]
32.82
0.938
2021
MPRNet[40]
32.66
0.936
2021
HINet[41]
32.77
0.936
2022
DGUNet[61]
32.71
0.937
2022
NAFNet[62]
32.87
0.948
2022
IPT[32]
32.58
0.935
2020
Uformer-B[46]
32.97
0.942
2021
Restormer[47]
32.92
0.940
2022
MFFDNet[44]
32.87
0.959
2023
Our
33.43
0.961
2023

4.2 Results and Analysis

4.2.1 Deblurring Results on the GoPro Dataset

Quantitative Evaluation: On the GoPro dataset, we compared our proposed HPAT method against existing CNN-based and Transformer-based methods, as depicted in Table 2. HPAT outperforms all competing methods in both PSNR and SSIM metrics. Compared to the runner-up, HPAT achieved an increase of 0.46 dB in PSNR on the GoPro dataset, and an improvement of 10% in SSIM.
Qualitative Evaluation: Visual comparisons between our HPAT model and the contrastive methods on the GoPro dataset are illustrated in Figs. 6 and 7. Evidently, our model has restored images with greater clarity. Prior methods exhibited shortcomings in restoring the overall contour of vehicles and intricate details of tire rims. They also struggled to accurately recover fine details in rapidly moving human limbs. Overall, our results exhibit closer edge and detail texture resemblance to real images, especially when dealing with fast-moving subjects.

4.2.2 Deblurring Results on the HIDE Dataset

Table 3
Performance comparison on HIDE dataset. Our method is trained on the GoPro dataset and applied to the HIDE dataset. The best and second-best scores are highlighted and underlined
Method
PSNR\(\uparrow \)
SSIM\(\uparrow \)
Params
Time
Year
DeepDeblur[1]
25.73
0.874
12
86
2017
DeblurGan-v2[26]
27.51
0.848
61
40
2018
DMPHN[56]
27.79
0.864
22
217
2019
DBGAN[59]
28.97
0.913
12
899
2020
MIMO-UNet[60]
29.99
0.906
16
21
2020
DeepRFT[38]
30.99
0.919
10
236
2021
MPRNet[40]
30.96
0.917
20
98
2021
HINet[41]
30.33
0.909
89
24
2022
DGUNet[61]
30.96
0.918
18
-
2022
Uformer-B[46]
30.89
0.919
51
307
2021
Restormer[47]
31.22
0.921
26
778
2022
MFFDNet[44]
30.16
0.932
38
119
2023
Our
30.79
0.939
58
314
2023
Quantitative Evaluation: Similar to metrics work[42], we evaluated our model trained on the GoPro dataset directly on the HIDE dataset. Results in Table 3 show that HPAT obtains higher PSNR and SSIM values. Although the PSNR is slightly lower than the best-performing prior model, the SSIM metric surpasses all competing methods. Notably, while achieving an improvement of 0.46 dB in PSNR on the GoPro dataset, HPAT also attains an increase of 7.4% in SSIM. Remarkably, the model’s untrained performance on this dataset highlights its superior generalization capability.
In order to solve the problem of suboptimal PSNR performance on the HIDE dataset, we conducted an in-depth analysis of the characteristics of the dataset and the design direction of the HPAT model. The HIDE dataset centers around images featuring individuals with concealed or partially obscured features, encompassing complex backgrounds, occlusions, and varying degrees of blur. These factors create challenges for PSNR assessment. In this context, the HPAT model emphasizes global information processing. Its performance in pixel-level accuracy may be slightly less pronounced compared to models that focus on reconstructing local pixel-level details. It is worth noting the difference between the evaluation metrics: PSNR emphasizes pixel-level errors, while SSIM focuses more on perceptual features such as image structure, contrast, and texture. Since the HPAT model prioritizes maintaining the consistency of the overall structure, when the image contains complex scenes and diverse content, its pixel-level reconstruction performance may be relatively poor compared with methods that pay more attention to local details.
Qualitative Evaluation: Visual comparison results between our HPAT method and evaluation methods on the HIDE dataset are displayed in Figs. 8 and  9. For facial deblurring, our approach not only restores rich detail information but also avoids chromatic aberration and artifact occurrence compared to other methods.

4.2.3 Deblurring Results on the RealBlur Dataset

Table 4
The average PSNR and SSIM on the RealBlur dataset. Applying our GoPro trained model directly on the RealBlur set
Methods
RealBlur_J
RealBlur_R
Average
Year
 
PSNR\(\uparrow \)
SSIM\(\uparrow \)
PSNR\(\uparrow \)
SSIM\(\uparrow \)
PSNR\(\uparrow \)
SSIM\(\uparrow \)
 
DeblurGan-v2[26]
26.68
0.815
33.41
0.928
30.05
0.872
2018
DMPHN[56]
26.75
0.825
33.21
0.936
29.98
0.881
2019
DeepRFT[38]
26.66
0.823
34.03
0.943
30.35
0.883
2021
MPRNet[40]
26.51
0.820
33.91
0.942
30.21
0.881
2021
HINet[41]
26.36
0.800
33.80
0.938
30.08
0.869
2021
MSSNet[63]
26.59
0.826
33.93
0.945
30.26
0.886
2022
DGUNet[61]
26.60
0.824
33.96
0.943
30.28
0.884
2022
Uformer-B[46]
26.65
0.828
33.85
0.943
30.25
0.886
2021
Restormer[47]
26.63
0.823
33.98
0.946
30.31
0.885
2022
MFFDNet[44]
28.42
0.864
35.71
0.949
32.07
0.910
2023
Ours
28.76
0.876
36.02
0.954
32.39
0.915
2023
Quantitative Evaluation: Consistent with previous practice [46, 47], we further verified our proposed model’s generalization capability on the RealBlur dataset. It’s important to note that we directly employed the model trained on the GoPro dataset for evaluation on the RealBlur dataset, without additional training. Table 4 summarizes quantitative analysis results on the RealBlur_J and RealBlur_R datasets. The outcomes underscore the robust generalization ability of our model.
Qualitative Evaluation: Figs. 10,  11,  12 and  13 showcase the deblurring efficacy of our approach on the RealBlur dataset and compare it with contrastive methods. Satisfactory results were achieved in both high-light and low-light scene blurring scenarios. Notably, our method successfully restored text and patterns with greater precision and authenticity.
In the depicted results from Figs. 14 and  15, we showcase the deblurring effectiveness of our model on autonomously captured images in real-world scenarios. In comparison to alternative models, our outcomes exhibit superiority in terms of both clarity and visual appeal. The reconstructed scenes not only accurately restore the shapes of objects such as clocks, railings, and windows but also capture intricate details like scales, pointers, and bars to a considerable extent.

4.3 Ablation Experiments

In this section, we conducted ablation experiments on each module of the HPAT structure to analyze their effectiveness and contributions. These ablation experiments were solely performed on the GoPro dataset. For efficient evaluation, we conducted training on image blocks of size 128–128.

4.3.1 Hierarchical Attention Fusion Module

Overall, the complete integration of the Hierarchical Attention Fusion Module led to a PSNR gain of 0.16 dB, as presented in Table 5h. Subsequently, we incrementally introduced the functional blocks of the module to assess their individual effectiveness. However, Table 5f indicates that blindly incorporating the ASTB during the feature extraction phase to capture long-range dependencies between pixels led to a decrease in metrics. This phenomenon might stem from the lack of non-linear modeling, thereby failing to adequately integrate and filter features, resulting in irrelevant features interfering with the final enhanced feature. Table 5g corroborates this speculation but still presents inconsistencies in generated features at higher levels, causing difficulty in effectively utilizing different hierarchical feature information, especially lacking efficient contextual information in deeper layers. The CFIM further mitigated these shortcomings, facilitating global feature modeling and integration, effectively overcoming limitations in early-stage feature processing within the encoder-decoder structure, as indicated in Table 5h.
To validate the effectiveness of prioritizing global information over local features, we conducted experiments. We replaced the global strategy of the HAFM module with local strategies using either a single convolution or multiple convolutions. As indicated in Table 6, when prioritizing global information, the HAFM module exhibited a minimum 0.14dB improvement in PSNR compared to the local strategy.
Considering global consistency, blurriness usually impacts the entire regions of an image. Thus, prioritizing global information aids in restoring the overall consistency and coherence of the image. For contextual understanding, global information assists in comprehending the scenes and objects within an image. It guides the deblurring algorithm to recover details more effectively in crucial areas while maintaining overall visual harmony. Concerning advanced feature extraction, global information facilitates extracting high-level features that encompass the overall structure and content of the image. This offers improved guidance for deblurring processes. When dealing with scenarios involving extensive blurring, like those caused by camera motion or prolonged exposure, which often affect the entire image, deblurring methods relying on local features may perform poorly. Attention to global information proves more effective in such cases.

4.3.2 Fusion Feedforward Network

To assess the effectiveness and rationale of the Fusion Feedforward Network (F3N), we first examined the impact of fusion operations within F3N. This involved a comparison of various block partitioning methods and sizes. From the results in Table 7, we observe that smaller combinations of ’Number of Patches’ and ’Patch Size’ may result in a reduced receptive field, inadequate for capturing global image information. This limitation could impede the effective transmission and capture of global features, leading to poor PSNR performance, particularly in deblurring tasks with extensive contextual information. Conversely, choosing larger combinations of ’Number of Patches’ and ’Patch Size’ could result in an overly large receptive field, introducing an excess of contextual information and complicating information fusion.
Table 5
Component analysis of HPAT
Variant
Component
PSNR\(\uparrow \)(dB)
Baseline(a)
Model with Resblock(d)
32.72
Encoder-Decoder(b)
LETB(W-MSA+F3N)(e)
32.83
 
ASTB(CA-MSA+FFN)(f)
32.78
HAFM(c)
ASTB(CA-MSA+F3N))(g)
32.82
 
ASTB(CA-MSA+FEN)+CFIM(h)
32.99
Dividing the image feature tensor into 9 blocks of size 7x7 resulted in optimal performance according to the PSNR metric. This choice achieves a balance between receptive field size and information fusion, offering an effective and efficient mechanism for feature learning. Similarly, as seen in the results presented in Table 8, this approach demonstrated superior performance in the PSNR metric, underscoring the effectiveness of this partitioning scheme in information transmission and preservation.
Table 6
Dismantling analysis of the efficacy of local features and global information components
Variant
Component
PSNR\(\uparrow \)(dB)
 
Conv
32.87
HAFM
Conv+Conv+Conv
32.91
 
ASTB(CA-MSA+FEN)+CFIM
32.99
Table 7
Study on the ablation of feedforward networks by different block partitioning methods
Number of Patches
Patch size
PSNR\(\uparrow \)(dB)
4
14
32.93
9
7
32.99
16
5
32.97
25
4
32.94
Table 8
Ablation of feedforward networks with different block sizes
Number of Patches
Patch size
PSNR\(\uparrow \)(dB)
 
5
32.95
9
7
32.99
 
10
32.98
 
14
32.96
Table 9
Ablation of feedforward networks in encoder-decoder
Variant
Component
PSNR\(\uparrow \)(dB)
Feedforward
LETB(W-MSA+FFN)
32.87
Network
LETB(W-MSA+LEFF)
32.95
 
LETB(W-MSA+F3N)
32.99
To further validate the effect of the fusion feedforward network within the encoder-decoder structure, we maintained other network structures and hyperparameters constant, while solely replacing the feedforward network for the ablation study with the following three options: (1) FFN, (2) LEFF, and (3) F3N. Table 9 showcases their respective outcomes, demonstrating performance improvements that validate the positive impact of this design. Our Fusion Feedforward Network, when compared to the commonly found feedforward networks in Transformer structures, effectively reduces jagged boundary artifacts through multiple aggregations and integrations of overlapping pixel regions. Simultaneously, it incorporates more surrounding pixel information, thereby enhancing the richness of feature representation.

5 Conclusion and Prospects

In this paper, we introduce a novel multi-level network structure (HPAT) based on Transformer, which aims to fully exploit local and non-local image features. Specifically, we propose the Hierarchical Attention Fusion Module (HAFM) to leverage global context information during feature extraction. This module enhances feature expression and improves the learning ability of the encoder-decoder architecture. Next, we introduce the fused feedforward network (F3N), which uses folding and unfolding operations for feature fusion and reconstruction. This technique prevents aliasing artifacts in image deblurring, captures a wider range of dependencies, and produces more accurate and robust deblurring results. Our approach demonstrates performance advantages, interpretability, and generalization capabilities, and is supported by validation on a variety of datasets. Finally, we conduct thorough ablation experiments to confirm the effectiveness and contribution of each module in improving image deblurring results. Our work mainly focuses on processing global information, and still faces performance challenges when processing small objects or scenes with complex details. In future research, our focus will be on optimizing the trade-off between global and local information.

Acknowledgements

This work has been partially supported by the Sichuan science and technology program(Grant No: 2022YFG0095, 2023YFG0300).

Declarations

Conflict of interest

Not applicable.

Ethical approval

Not applicable.
Not applicable
Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3883–3891 Nah S, Kim TH, Lee KM (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3883–3891
2.
Zurück zum Zitat Lai WS, Shih Y, Chu LC et al (2022) Face deblurring using dual camera fusion on mobile phones. ACM Transact Graph (TOG) 41(4):1–16CrossRef Lai WS, Shih Y, Chu LC et al (2022) Face deblurring using dual camera fusion on mobile phones. ACM Transact Graph (TOG) 41(4):1–16CrossRef
3.
Zurück zum Zitat Li Y, Li X (2023) Automatic segmentation using deep convolutional neural networks for tumor CT images. Int J Pattern Recogn Artif Intell 37(03):2352003CrossRef Li Y, Li X (2023) Automatic segmentation using deep convolutional neural networks for tumor CT images. Int J Pattern Recogn Artif Intell 37(03):2352003CrossRef
4.
Zurück zum Zitat McManamon P, Piracha U, Jameson S et al (2023) Special section guest editorial: autonomous vehicles. Opt Eng 62(3):031201–031201CrossRef McManamon P, Piracha U, Jameson S et al (2023) Special section guest editorial: autonomous vehicles. Opt Eng 62(3):031201–031201CrossRef
5.
Zurück zum Zitat Yang M, Jiao L, Liu F et al (2019) Transferred deep learning-based change detection in remote sensing images. IEEE Transact Geosci Remote Sens 57(9):6960–6973CrossRef Yang M, Jiao L, Liu F et al (2019) Transferred deep learning-based change detection in remote sensing images. IEEE Transact Geosci Remote Sens 57(9):6960–6973CrossRef
6.
Zurück zum Zitat Rudin LI, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenom 60(1–4):259–268MathSciNetCrossRef Rudin LI, Osher S, Fatemi E (1992) Nonlinear total variation based noise removal algorithms. Physica D Nonlinear Phenom 60(1–4):259–268MathSciNetCrossRef
7.
Zurück zum Zitat Dabov K, Foi A, Katkovnik V et al (2007) Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transact Image Process 16(8):2080–2095MathSciNetCrossRef Dabov K, Foi A, Katkovnik V et al (2007) Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Transact Image Process 16(8):2080–2095MathSciNetCrossRef
8.
Zurück zum Zitat Hyun Kim T, Ahn B, Mu Lee K (2013) Dynamic scene deblurring. In: Proceedings of the IEEE international conference on computer vision, pp 3160–3167 Hyun Kim T, Ahn B, Mu Lee K (2013) Dynamic scene deblurring. In: Proceedings of the IEEE international conference on computer vision, pp 3160–3167
9.
Zurück zum Zitat Xu L, Zheng S, Jia J (2013) Unnatural l0 sparse representation for natural image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1107–1114 Xu L, Zheng S, Jia J (2013) Unnatural l0 sparse representation for natural image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1107–1114
10.
Zurück zum Zitat Pan J, Hu Z, Su Z et al (2016) Soft-segmentation guided object motion deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 459–468 Pan J, Hu Z, Su Z et al (2016) Soft-segmentation guided object motion deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 459–468
11.
Zurück zum Zitat He K, Sun J, Tang X (2010) Single image haze removal using dark channel prior. IEEE Transact Pattern Anal Mach Intell 33(12):2341–2353 He K, Sun J, Tang X (2010) Single image haze removal using dark channel prior. IEEE Transact Pattern Anal Mach Intell 33(12):2341–2353
12.
Zurück zum Zitat Gu S, Zhang L, Zuo W et al (2014) Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2869 Gu S, Zhang L, Zuo W et al (2014) Weighted nuclear norm minimization with application to image denoising. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2869
13.
Zurück zum Zitat Dong W, Zhang L, Shi G et al (2011) Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transact Image Process 20(7):1838–1857MathSciNetCrossRef Dong W, Zhang L, Shi G et al (2011) Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transact Image Process 20(7):1838–1857MathSciNetCrossRef
14.
Zurück zum Zitat Xie J, Hou G, Wang G et al (2021) A variational framework for underwater image dehazing and deblurring. IEEE Transact Circuits Syst Video Technol 32(6):3514–3526CrossRef Xie J, Hou G, Wang G et al (2021) A variational framework for underwater image dehazing and deblurring. IEEE Transact Circuits Syst Video Technol 32(6):3514–3526CrossRef
15.
Zurück zum Zitat He R, Zheng WS, Tan T et al (2013) Half-quadratic-based iterative minimization for robust sparse representation. IEEE Transact Pattern Anal Mach Intell 36(2):261–275 He R, Zheng WS, Tan T et al (2013) Half-quadratic-based iterative minimization for robust sparse representation. IEEE Transact Pattern Anal Mach Intell 36(2):261–275
16.
Zurück zum Zitat Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRef Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRef
17.
18.
Zurück zum Zitat Li J, Tan W, Yan B (2021) Perceptual variousness motion deblurring with light global context refinement. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4116–4125 Li J, Tan W, Yan B (2021) Perceptual variousness motion deblurring with light global context refinement. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4116–4125
19.
Zurück zum Zitat Zamir S.W, Arora A, Khan S et al (2020) Learning enriched features for real image restoration and enhancement. In: Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXV 16, pp 492–511 Zamir S.W, Arora A, Khan S et al (2020) Learning enriched features for real image restoration and enhancement. In: Computer vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXV 16, pp 492–511
20.
Zurück zum Zitat Gao Z, Li E, Wang Z et al (2021) Object reconstruction based on attentive recurrent network from single and multiple images. Neural Process Lett 53:653–670CrossRef Gao Z, Li E, Wang Z et al (2021) Object reconstruction based on attentive recurrent network from single and multiple images. Neural Process Lett 53:653–670CrossRef
21.
Zurück zum Zitat Park D, Kang D.U, Kim J et al (2020) Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In: European Conference on Computer Vision, pp 327–343 Park D, Kang D.U, Kim J et al (2020) Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In: European Conference on Computer Vision, pp 327–343
22.
Zurück zum Zitat Suin M, Purohit K, Rajagopalan AN (2020) Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3606–3615 Suin M, Purohit K, Rajagopalan AN (2020) Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3606–3615
23.
Zurück zum Zitat Lim S, Kim J, Kim W (2020) Deep spectral-spatial network for single image deblurring. IEEE Signal Process Lett 27:835–839CrossRef Lim S, Kim J, Kim W (2020) Deep spectral-spatial network for single image deblurring. IEEE Signal Process Lett 27:835–839CrossRef
24.
Zurück zum Zitat Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision-ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp 818–833 Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision-ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp 818–833
25.
Zurück zum Zitat Li X, Wu J, Lin Z et al (2018) Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: Proceedings of the European conference on computer vision (ECCV), pp 254–269 Li X, Wu J, Lin Z et al (2018) Recurrent squeeze-and-excitation context aggregation net for single image deraining. In: Proceedings of the European conference on computer vision (ECCV), pp 254–269
26.
Zurück zum Zitat Kupyn O, Martyniuk T, Wu J et al (2019) Deblurgan-v2: deblurring (orders-of-magnitude) faster and better. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8878–8887 Kupyn O, Martyniuk T, Wu J et al (2019) Deblurgan-v2: deblurring (orders-of-magnitude) faster and better. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8878–8887
27.
Zurück zum Zitat Wang M, Hou S, Li H et al (2019) Generative image deblurring based on multi-scaled residual adversary network driven by composed prior-posterior loss. J Vis Commun Image Represent 65:102648CrossRef Wang M, Hou S, Li H et al (2019) Generative image deblurring based on multi-scaled residual adversary network driven by composed prior-posterior loss. J Vis Commun Image Represent 65:102648CrossRef
28.
Zurück zum Zitat Jiang G, Chen H, Wang C et al (2022) Transformer network intelligent flight situation awareness assessment based on pilot visual gaze and operation behavior data. Int J Pattern Recogn Artif Intell 36(05):2259015CrossRef Jiang G, Chen H, Wang C et al (2022) Transformer network intelligent flight situation awareness assessment based on pilot visual gaze and operation behavior data. Int J Pattern Recogn Artif Intell 36(05):2259015CrossRef
29.
Zurück zum Zitat Liang J, Cao J, Sun G et al (2021) Swinir: image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1833–1844 Liang J, Cao J, Sun G et al (2021) Swinir: image restoration using swin transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1833–1844
30.
Zurück zum Zitat Chu X, Tian Z, Wang Y et al (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366 Chu X, Tian Z, Wang Y et al (2021) Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366
31.
Zurück zum Zitat Yuan L, Hou Q, Jiang Z et al (2022) Volo: vision outlooker for visual recognition. IEEE Transact Pattern Anal Mach Intell 45(5):6575–6586 Yuan L, Hou Q, Jiang Z et al (2022) Volo: vision outlooker for visual recognition. IEEE Transact Pattern Anal Mach Intell 45(5):6575–6586
32.
Zurück zum Zitat Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310 Chen H, Wang Y, Guo T et al (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12299–12310
33.
Zurück zum Zitat Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022 Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
34.
Zurück zum Zitat Morikawa C, Kobayashi M, Satoh M et al (2021) Image and video processing on mobile devices: a survey. Vis Comput 37(12):2931–2949CrossRef Morikawa C, Kobayashi M, Satoh M et al (2021) Image and video processing on mobile devices: a survey. Vis Comput 37(12):2931–2949CrossRef
35.
Zurück zum Zitat Schuler CJ, Hirsch M, Harmeling S et al (2015) Learning to deblur. IEEE Transact Pattern Anal Mach Intell 38(7):1439–1451CrossRef Schuler CJ, Hirsch M, Harmeling S et al (2015) Learning to deblur. IEEE Transact Pattern Anal Mach Intell 38(7):1439–1451CrossRef
36.
Zurück zum Zitat Tao X, Gao H, Shen X et al (2018) Scale-recurrent network for deep image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8174–8182 Tao X, Gao H, Shen X et al (2018) Scale-recurrent network for deep image deblurring. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8174–8182
37.
Zurück zum Zitat Kupyn O, Budzan V, Mykhailych M et al (2018) Deblurgan: Blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8183–8192 Kupyn O, Budzan V, Mykhailych M et al (2018) Deblurgan: Blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8183–8192
38.
Zurück zum Zitat Mao X, Liu Y, Shen W, et al (2021) Deep residual fourier transformation for single image deblurring. arXiv preprint arXiv:2111.11745 2(3), 5 Mao X, Liu Y, Shen W, et al (2021) Deep residual fourier transformation for single image deblurring. arXiv preprint arXiv:​2111.​11745 2(3), 5
39.
Zurück zum Zitat Chi L, Jiang B, Mu Y (2020) Fast fourier convolution. Adv Neural Inf Process Syst 33:4479–4488 Chi L, Jiang B, Mu Y (2020) Fast fourier convolution. Adv Neural Inf Process Syst 33:4479–4488
40.
Zurück zum Zitat Zamir S.W, Arora A, Khan S et al (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14821–14831 Zamir S.W, Arora A, Khan S et al (2021) Multi-stage progressive image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14821–14831
41.
Zurück zum Zitat Chen L, Lu X, Zhang J et al (2021) Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 182–192 Chen L, Lu X, Zhang J et al (2021) Hinet: half instance normalization network for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 182–192
42.
Zurück zum Zitat Tu Z, Talebi H, Zhang H et al (2022) Maxim: multi-axis MLP for image processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5769–5780 Tu Z, Talebi H, Zhang H et al (2022) Maxim: multi-axis MLP for image processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5769–5780
43.
Zurück zum Zitat Dauphin Y.N, Fan A, Auli M et al (2017) Language modeling with gated convolutional networks. In: International conference on machine learning, pp 933–941 Dauphin Y.N, Fan A, Auli M et al (2017) Language modeling with gated convolutional networks. In: International conference on machine learning, pp 933–941
44.
Zurück zum Zitat Guo C, Wang Q, Dai HN et al (2023) Multi-stage feature-fusion dense network for motion deblurring. J Vis Commun Image Represent 90:103717CrossRef Guo C, Wang Q, Dai HN et al (2023) Multi-stage feature-fusion dense network for motion deblurring. J Vis Commun Image Represent 90:103717CrossRef
45.
Zurück zum Zitat Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:​2010.​11929
46.
Zurück zum Zitat Wang Z, Cun X, Bao J et al (2022) Uformer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17683–17693 Wang Z, Cun X, Bao J et al (2022) Uformer: a general u-shaped transformer for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17683–17693
47.
Zurück zum Zitat Zamir S.W, Arora A, Khan S et al (2022) Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5728–5739 Zamir S.W, Arora A, Khan S et al (2022) Restormer: efficient transformer for high-resolution image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5728–5739
48.
Zurück zum Zitat Lee H, Choi H, Sohn K et al (2023) Cross-scale KNN image transformer for image restoration. IEEE Access 11:13013–13027CrossRef Lee H, Choi H, Sohn K et al (2023) Cross-scale KNN image transformer for image restoration. IEEE Access 11:13013–13027CrossRef
49.
Zurück zum Zitat Wu H, Xiao B, Codella N et al (2021) CVT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22–31 Wu H, Xiao B, Codella N et al (2021) CVT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 22–31
50.
Zurück zum Zitat Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588 Yuan K, Guo S, Liu Z et al (2021) Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 579–588
51.
Zurück zum Zitat Ho J, Kalchbrenner N, Weissenborn D et al (2019) Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 Ho J, Kalchbrenner N, Weissenborn D et al (2019) Axial attention in multidimensional transformers. arXiv preprint arXiv:​1912.​12180
52.
Zurück zum Zitat Wang H, Zhu Y, Green B et al (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: European conference on computer vision, pp 108–126 Wang H, Zhu Y, Green B et al (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In: European conference on computer vision, pp 108–126
53.
Zurück zum Zitat Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transact Image Process 13(4):600–612CrossRef Wang Z, Bovik AC, Sheikh HR et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transact Image Process 13(4):600–612CrossRef
54.
Zurück zum Zitat Shen Z, Wang W, Lu X et al (2019) Human-aware motion deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5572–5581 Shen Z, Wang W, Lu X et al (2019) Human-aware motion deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5572–5581
55.
Zurück zum Zitat Rim J, Lee H, Won J et al (2020) Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp 184–201 Rim J, Lee H, Won J et al (2020) Real-world blur dataset for learning and benchmarking deblurring algorithms. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16, pp 184–201
56.
Zurück zum Zitat Zhang H, Dai Y, Li H et al (2019) Deep stacked hierarchical multi-patch network for image deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5978–5986 Zhang H, Dai Y, Li H et al (2019) Deep stacked hierarchical multi-patch network for image deblurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5978–5986
58.
59.
Zurück zum Zitat Zhang K, Luo W, Zhong Y et al (2020) Deblurring by realistic blurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2737–2746 Zhang K, Luo W, Zhong Y et al (2020) Deblurring by realistic blurring. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2737–2746
60.
Zurück zum Zitat Cho SJ, Ji SW, Hong JP et al (2021) Rethinking coarse-to-fine approach in single image deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4641–4650 Cho SJ, Ji SW, Hong JP et al (2021) Rethinking coarse-to-fine approach in single image deblurring. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4641–4650
61.
Zurück zum Zitat Mou C, Wang Q, Zhang J (2022) Deep generalized unfolding networks for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17399–17410 Mou C, Wang Q, Zhang J (2022) Deep generalized unfolding networks for image restoration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17399–17410
62.
Zurück zum Zitat Chen L, Chu X, Zhang X et al (2022) Simple baselines for image restoration. In: European conference on computer vision, PP 17–33 Chen L, Chu X, Zhang X et al (2022) Simple baselines for image restoration. In: European conference on computer vision, PP 17–33
63.
Zurück zum Zitat Kim K, Lee S, Cho S (2022) MSSNET: multi-scale-stage network for single image deblurring. In: European conference on computer vision, pp 524–539 Kim K, Lee S, Cho S (2022) MSSNET: multi-scale-stage network for single image deblurring. In: European conference on computer vision, pp 524–539
Metadaten
Titel
Hierarchical Patch Aggregation Transformer for Motion Deblurring
verfasst von
Yujie Wu
Lei Liang
Siyao Ling
Zhisheng Gao
Publikationsdatum
01.04.2024
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 2/2024
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-024-11594-0

Weitere Artikel der Ausgabe 2/2024

Neural Processing Letters 2/2024 Zur Ausgabe

Neuer Inhalt