Skip to main content
Top
Published in: Complex & Intelligent Systems 1/2024

Open Access 03-08-2023 | Original Article

Multi-scale attention-based lightweight network with dilated convolutions for infrared and visible image fusion

Authors: Fuquan Li, Yonghui Zhou, YanLi Chen, Jie Li, ZhiCheng Dong, Mian Tan

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Infrared and visible image fusion aims to generate synthetic images including salient targets and abundant texture details. However, traditional techniques and recent deep learning-based approaches have faced challenges in preserving prominent structures and fine-grained features. In this study, we propose a lightweight infrared and visible image fusion network utilizing multi-scale attention modules and hybrid dilated convolutional blocks to preserve significant structural features and fine-grained textural details. First, we design a hybrid dilated convolutional block with different dilation rates that enable the extraction of prominent structure features by enlarging the receptive field in the fusion network. Compared with other deep learning methods, our method can obtain more high-level semantic information without piling up a large number of convolutional blocks, effectively improving the ability of feature representation. Second, distinct attention modules are designed to integrate into different layers of the network to fully exploit contextual information of the source images, and we leverage the total loss to guide the fusion process to focus on vital regions and compensate for missing information. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method over state-of-the-art methods in both visual effects and evaluation metrics. The experimental results on public datasets show that our method can improve the entropy (EN) by 4.80%, standard deviation (SD) by 3.97%, correlation coefficient (CC) by 1.86%, correlations of differences (SCD) by 9.98%, and multi-scale structural similarity (MS_SSIM) by 5.64%, respectively. In addition, experiments with the VIFB dataset further indicate that our approach outperforms other comparable models.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Images collected by a single mode sensor fail to effectively and comprehensively describe imaging scenes due to theoretical and technical limitations [1]. Infrared sensors capture thermal radiation emitted by objects and can generate infrared images with significant targets, even in adverse conditions such as low brightness, occlusions, or harsh weather. However, infrared images are susceptible to noise and lack textural details. In contrast, visible images offer abundant texture and structural information but are susceptible to imaging conditions. As such, infrared and visible image fusion tasks involve reconstructing a single image with comprehensive information from multimodal data, providing both significant targets and valuable texture information. Motivated by variations in imaging scenes, several excellent fusion algorithms have been proposed for broad applications in various advanced vision tasks, including object detection [2], semantic segmentation [3], pedestrian re-recognition [4], and visual tracking [5].
In recent years, the fusion of infrared and visible images has attracted the attention of many scholars and has developed rapidly as a result. Existing technologies can be categorized into two groups: traditional methods [6, 7] and deep learning-based methods [812]. Traditional image fusion algorithms are typically implemented using multi-scale transform (MST)-based methods [13], sparse representation (SR)-based methods [14], low-rank representation (LRR)  [15], saliency-based methods [16], subspace-based methods [17], and other methods [18]. Although traditional methods have shown superior fusion performance in some aspects, they are also known to encounter specific challenges. (1) They generally require manually selected feature representations and accurate fusion rules when generating high-quality fused images, which require manual intervention and can degrade fusion performance. (2) In the case of SR and LRR techniques, it can be difficult to construct a suitable overcomplete dictionary. The runtime for corresponding fusion algorithms is, thus, not conducive to real-time image fusion. (3) Complex feature extraction and fusion strategies often introduce halos and blurred edges, due to the overlapping of asymmetric feature information.
To address these issues, deep learning-based methods have been introduced for infrared and visible image fusion. These frameworks can typically be divided into three categories: auto-encoder (AE) [8, 9], convolutional neural network (CNN) [19], and generative adversarial network (GAN) based architectures [20]. Deep learning offers several advantages for improved representation capabilities, but with certain limitations. First, to reduce the complexity of the network, introducing a down-sampling operation to reduce the image resolution inevitably results in the loss of important information in the fusion image. On the other hand, modern convolutional networks are not shift-invariant[21], as small shifts or translations in the input cause substantial changes in the output. Second, some methods use only simple feature fusion rules, such as addition and connection, which can cause artifacts or blurred edges in fused images. Third, the existing infrared and visible images used for training and testing are mainly derived from the TNO [22] and RoadScene [23] datasets, which restricts the comprehensive evaluation of the model’s generalization performance. FusionGAN [24] crops the source images into image patches by setting the stride to 14, while lacking global information to learn over long distances and failing to handle complex scenes.
A novel deep learning architecture for the fusion of infrared and visible images is proposed in this paper to address the issues discussed above.
Inspired by previous traditional multi-scale frameworks, we design an encoder network, consisting of hybrid dilated convolutional blocks used to obtain multi-scale depth salient features by implementing different dilation rates. It is worth noting that no down-sampling is used during feature extraction, so the resulting feature map is the same size as the source images. In addition, to make full use of multi-scale layer characteristics, we introduce different attention modules for each scale, to ensure the network pays attention to specific features and compensates for information loss. To demonstrate the effectiveness of our approach, a representative fusion sample is shown in Fig. 1 and compared with three other excellent deep learning-based algorithms. Our method not only produces higher image contrast (e.g., the person in the infrared image is brighter using our technique), but also improves visual effects (e.g., smoke is preserved in the visible images, and trees in the background exhibit clearer edges). The primary contributions of our work can be summarized as follows:
  • We propose a lightweight network architecture for infrared and visible image fusion, which can capture fine-grained detailed features with a high semantic level and does not require a down-sampling operation.
  • Both spatial attention and channel attention mechanisms are introduced in the encoder-decoder framework at different scales. The proposed method not only forces the network to focus on foreground targets of the infrared image and the background information in the visible image, it also enhances local and global contextual information and attenuates noise.
  • A total loss function is designed to jointly focus on pixel distribution information and texture details in both infrared and visible images, to preserve essential complementary information in each modality.
  • Extensive experiments demonstrate our method’s superiority over state-of-the-art methods. The experimental results on public datasets reveal that our method achieves significant enhancements in entropy (EN) by 4.80%, standard deviation (SD) by 3.97%, correlation coefficient (CC) by 1.86%, correlations of differences (SCD) by 9.98%, and multi-scale structural similarity (MS_SSIM) by 5.64%.

Traditional image fusion methods

Traditional image fusion algorithms can be divided into three steps: feature extraction, fusion, and reconstruction. The feature extraction and reconstruction steps are typically opposite operations. Several multi-scale techniques such as Gaussian pyramid [7], shearlet [25], and nonsubsampled contourlet [26] transforms have been proposed in the past few decades, some of which are utilized in deep learning-based fusion frameworks. In addition, feature extraction methods used for sparse representations include joint sparse representation [14] and latent low-rank representations [27]. Inspired by human visual perception, this process requires an over-complete dictionary. As such, the computational complexity of sparse representations has always been an issue. In addition, by reducing the dimensionality of the original features into low-dimensionality of features that are independent of each other, representative techniques can be developed using subspace feature extraction, including independent component analysis [28], principal component analysis [29], and non-negative matrix factorization [30].

Deep learning-based fusion methods

Convolutional neural networks can learn prior knowledge from large image quantities and have been widely used for image fusion and other related tasks. Image fusion methods based on deep learning include AE-based algorithms, convolutional neural networks, and GAN-based image fusion models. Liu et al. [31] first proposed a CNN-based fusion framework. Since the purpose of the network is to generate a decision map, this approach is only suitable for multi-focus images. Li et al. [8] proposed a fusion method of nest connection-based architecture comprised of three parts: encoder network, fusion strategy, and decoder network, which extract deep features at different scales. This feature fusion is manually supervised by rules that affect fusion performance to a certain extent. Later, residual end-to-end auto-encoder fusion networks have been proposed to overcome the issue [9].
In addition, by forcing the network to focus on intensity distribution and texture structures in images, infrared and visible image fusion algorithms based on the end-to-end convolutional neural network provide a solution to this problem. For example, Ma et al. [32] used salient mask to force the network on texture details in visible images and salient information in infrared images. However, it can be difficult to provide ground truth data to the network for image fusion tasks. Considering extreme illumination conditions for source images, Tang et al. [33] introduced an illumination-aware sub-network that maintains intensity distributions in salient targets and preserves texture information in the background. Furthermore, to facilitate advanced visual tasks, this group introduced semantic segmentation into the image fusion module to improve the semantic information in the fused images. They also proposed a joint low-level and high-level adaptive training strategy to simultaneously achieve superior performance and close the gap in both image fusion and high-level vision tasks [34].
In 2019, Ma et al. [24] first introduced the generative adversarial networks into the field of infrared and visible image fusion. Specifically, content loss and adversarial loss are employed to preserve details of thermal radiation in the fused images generated from connected source images. However, a single discriminator cannot focus on both infrared and visible regions. As such, Li et al. [35] not only introduced a dual-discriminator conditional generative adversarial network, but also used a multi-scale attention mechanism to constrain the discriminator and focus more on regions of interest, to balance the data distribution and improve fused image fidelity.

Dilated convolutional and attention mechanism applications

Dilated convolution, inspired by wavelet decomposition, enhances the receptive field of a convolutional kernel by inserting zeros between its pixels. This expansion aids the network in capturing detailed information within the scene. Dilated convolution has been widely applied in image classification, object detection, and semantic segmentation. Yu et al. [36] addressed the issue of gridding artifacts introduced by dilation by designing dilated residual networks, which can be effectively employed in downstream tasks such as object localization and semantic segmentation.
The attention mechanism, motivated by the human visual system, has been successfully incorporated into computer vision systems such as image recognition, object detection, semantic segmentation, and action recognition [37]. Channel attention focuses on important objects by assigning new weights to the channels of the feature map. Hu et al. [38] first proposed the concept of channel attention, known as SENet. The core squeeze-and-excitation (SE) block of SENet effectively captures the channel-wise relationship, thereby enhancing the representation capability of the network model. Qin et al. [39] demonstrated that global average pooling can be viewed as a special case of the discrete cosine transform and designed a multi-spectral channel attention mechanism to further enhance the model’s representation capabilities. Spatial attention, on the other hand, can be seen as the adaptive selection of important spatial regions. Hu et al. [40] designed GENet to capture long-distance spatial contextual information in feature maps, enabling the highlighting of important features while suppressing noise. Building upon the success of self-attention in natural language processing, Wang et al. [41] proposed Non-Local networks that expand the receptive fields of the network, enabling the capture of global information. In the context of image fusion, Ma et al. [42] introduced Swin Transformer and proposed intra-domain and inter-domain fusion units based on self-attention and cross-attention, respectively. This approach achieves the integration of complementary information and captures global long-range dependencies, facilitating the effective fusion of multi-domain images.

Methodology

This section describes the proposed lightweight infrared and visible image fusion network architecture in detail. First, we present the overall network pipeline. Hybrid dilated convolutional (HDC) blocks and multi-scale spatial/channel attention are then introduced. Finally, the proposed loss function is discussed.

Problem formulation

Given a pair of registered infrared \(I_{ir}\in R^{H\times W\times 1}\) and visible images \(I_{vis}\in R^{H\times W\times 3}\), under the guidance of a total loss function, the fused image \(I_{f}\in R^{H\times W\times 3} \) can be generated by feature extraction, feature fusion, and reconstruction. The previous deep learning methods emphasized the importance of feature extraction on the quality of fusion results, which led to designing complex feature extractors. However, the real-time image fusion requirement was ignored. In order to improve the ability of feature representation, while ensuring real-time infrared and visible image fusion, key design components for the lightweight HDC blocks and multi-scale attention mechanisms are designed to produce high-quality fused images and prevent artifacts. (we will discuss its network architecture in Section “Network architecture”). The overall framework for our proposed infrared and visible image fusion algorithm is shown in Fig. 2.
First, a fusion network based on HDC blocks is devised to fully extract the high-level semantic information in source images. More specifically, we apply a feature extraction module \(F_E\) to extract fine-grained feature information from infrared and visible images. This process can be represented as:
$$\begin{aligned}&\left\{ F_{i r}, F_{v i s}\right\} =\left\{ F_E\left( I_{i r}\right) , F_E\left( I_{v i s}\right) \right\} \text{, } \end{aligned}$$
(1)
where \(F_{ir}\) and \(F_{vis}\) represent feature maps for infrared and visible images, respectively. Moreover, HDC blocks are deployed in the feature extraction module to expand the receptive field while ensuring that important coarse-grained and fine-grained feature information is extracted, as shown in Fig. 3. Given the HDC input \(F_{i}\), the corresponding output \(F_{i+1}\) can be represented as:
$$\begin{aligned} F_{i+1}=HDC\left( F_i\right) =\phi \left( DConv^{n} (F_{i} )\right) , \end{aligned}$$
(2)
where \(DConv^{n}\) is an n-cascaded \(3\times 3\) dilated convolutional layer and \(\phi \) represents the LReLU activation function. Information flow is processed in HDC blocks using respective hierarchical levels in the pipeline. In this paper, HDC blocks capture local and global information of the source image to effectively facilitate feature representation capabilities.
The feature fusion and reconstruction module is responsible for converting the feature maps into the fused image. However, simply reconstructing the fused image using convolution operations may result in information loss. Therefore, we introduce different attention modules at different layers of the extractor to fully exploit contextual information from the source images and alleviate the information loss of the feature maps in reconstruction.
To integrate the abundant fine-grained detailed features in infrared and visible images and reconstruct the fused image, the element-wise addition strategy in [43] is used. The formula for this fusion process is as follows:
$$\begin{aligned}&F_f={\text {Add}}\left( \alpha _i\left( F_{i r}\right) , \alpha _i\left( F_{v i s}\right) \right) \text{, } \end{aligned}$$
(3)
where \(F_{f}\) is fused feature maps, \(Add(\cdot , \cdot )\) represents an element-wise addition strategy, and \(\alpha _i\) denotes an attention mechanism corresponding to multiple scales. Specifically, \(\alpha _1\) is employed to focus on coarse-grained information from infrared and visible images using a spatial attention mechanism. Both \(\alpha _{2}\) and \(\alpha _{3}\) are devoted to strengthening a large amount of fine-grained feature information using a channel attention mechanism. Finally, the fused image \(I_f\) is reconstructed from \(F_f\) via an image reconstructor \(R_i\) as follows:
$$\begin{aligned}&I_f=R_i\left( F_f\right) . \end{aligned}$$
(4)

Network architecture

The framework for the proposed lightweight fusion network based on hybrid dilated convolutional blocks (HDCBs), shown in Fig. 2, consists of encoder and decoder networks for feature extraction and image reconstruction, respectively.
The feature extractor utilizes three HDCBs to increase the size of the receptive field in the network and capture more contextual information, while ensuring fine-grained features are extracted from infrared and visible images. In addition, a multi-scale spatial/channel attention module is also proposed to retain valuable information and reduce artifacts in multi-modality images. In the feature extractor, a multi-scale shallow layer in the encoder focuses on the elemental features using a spatial attention module, while a channel attention module is used to pay attention to fine-grained features in source images on multi-scale deep layers of the encoder. These multi-scale attention features are added as inputs to corresponding layer features of the decoder network to reconstruct the fused image. As shown in Fig. 2, two parallel encoder modules are used to extract features from infrared and visible images containing three HDCBs with dilation rates of 1, 3, and 5, respectively. The special design of the HDCB is shown in Fig. 3. The block mainly changes the dilation rates of ordinary convolutions, which is set to prevent the occurrence of gridding problems. The mainstream applies to three convolutional layers with a kernel size of \(3\times 3\) and stride of 1, the batch normalization (BN) layers, and the LReLU layers. To preserve more diverse and important contextual information, the different attention modules are introduced to each scaling layer of the encoder, as shown in Fig. 4. The \(FM_{i}\) serves as the input to the attention module, acquired from the feature maps of each HDCB output in the encoder, while the \(FM_{o}\) provides the output of the attention module. The spatial attention mechanism is used by shallow features of the first HDCB, while the channel attention mechanism is exploited in the deep scaling layers.
Attention maps for infrared and visible images at different scales are then integrated via an element-wise addition strategy, and the results are fed into the decoder network to achieve image reconstruction. The decoder network in the image reconstructor generates fused images using three \(3\times 3\) convolutional layers and three BN layers, all of which are followed by an LReLU activation function. The stride is set to 1 in the fused network with no down-sampling operation, to reduce information loss. As such, fused images are the same size as the source images.

Loss function

A total loss function is proposed in this study to facilitate more comprehensive detail in the resulting images, obtained from salient target information in infrared images and fine-grained features in visible images. This total loss function consists of intensity loss \(L_{intensity}\) and detail loss \(L_{detail}\) terms, which is defined as follows:
$$\begin{aligned}&L_{total}=L_{intensity}+\gamma L_{detail}, \end{aligned}$$
(5)
where \(\gamma \) is a weight factor used to balance the intensity loss \(L_{intensity}\) and detail loss \(L_{detail}\).
The intensity loss is designed to constrain intensity similarity between the fused and input images at the pixel level. Therefore, the intensity loss is expressed as:
$$\begin{aligned}&L_{intensity}=\frac{1}{H W}\left\| I_f-\left( p I_{i r}+(1-p) I_{v i s}\right) \right\| _1, \end{aligned}$$
(6)
where W and H represent the width and height of the image, respectively, \(\left\| \cdot \right\| _{1}\) is the \(l_{1}\)-norm, and p denotes the weight of constraints used to integrate the distribution of pixel intensities in infrared and visible images.
However, fused images not only include the pixel intensity distribution of the source images, but also exhibit a fine-grained detail distribution. Hence, a detail loss is introduced to force the fused image to preserve more structure and fine-grained texture information. Detail loss can be expressed as:
$$\begin{aligned}&L_{detail}=\frac{1}{H W}\left\| \left| \nabla I_f\right| -\left( q\left| \nabla I_{i r}\right| +(1-q)\left| \nabla I_{vis}\right| \right) \right\| _1, \end{aligned}$$
(7)
where \(\nabla \) indicates the Sobel gradient operation used to measure the fine-grained information in the source images, q is a weight parameter that constrains the fine-grained features in infrared and visible images, and \(\left| \, \cdot \, \right| \) indicates the absolute value operation.
Finally, guided by the total loss function, our proposed fused network based on HDCBs and multi-scale attention provides fused images with a better pixel intensity distribution and larger quantities of detail information, to efficiently generate high-quality images.

Experiments

In this section, we first describe the experimental settings and training details. Then, we conduct both quantitative and qualitative comparative experiments and generalization experiments to fully evaluate the performance of our proposed fusion algorithm. Finally, we introduce ablation experiments to demonstrate the effectiveness of the model design, including detail loss and multi-scale spatial/channel attention.

Experimental settings

We perform extensive quantitative and qualitative experiments using the TNO [22], RoadScene [23], and VIFB [44] datasets to comprehensively evaluate the proposed fusion method. In addition, seven state-of-the-art image fusion algorithms are selected for comparison with our approach, including three typical traditional methods, i.e., IFEVIP [45], GTF [18] and CBF [46], two AE-based models, i.e., MFEIF [47] and NestFuse [8], one CNN-based method IFCNN [19], and one GAN-based method FusionGAN [24]. Implementations of these algorithms are publicly available and corresponding parameters are set in agreement with those in their respective papers.
Nine statistical evaluation indicators are used to quantitatively evaluate our method and the seven other excellent fusion methods. They are entropy (EN) [48], modified fusion artifacts measure (Nabf) [49], correlations of differences (SCD) [50], spatial frequency (SF) [51], standard deviation (SD) [52], peak signal to noise ratio(PSNR) [53], multi-scale structural similarity (MS_SSIM) [54], feature mutual information (FMI) and correlation coefficient (CC). These values increase as the fusion performance improved (excluding Nabf).
The EN measures the amount of information contained in a fused image as follows:
$$\begin{aligned}&E N=-\sum _{l=0}^L p_l \log _2 p_l, \end{aligned}$$
(8)
where L and \(p_l\) represent the total number of gray levels and the normalized histogram of the corresponding gray level in the fused image, respectively. A large EN indicates that a large amount of information is available, representing better fusion performance. Larger EN values may also be caused by noises.
The Nabf, which quantifies the number of noises or artifacts added in the fused image due to the fusion process, can be expressed as:
$$\begin{aligned}&N_{m}^{\frac{A B}{F}}=\frac{\sum _{\mathbf {\forall }i}\sum _{\mathbf {\forall }j}A M_{i,j}\left[ \left( 1-Q_{i,j}^{A F}\right) w_{i,j}^{A}+\left( 1-Q_{i,j}^{B F}\right) w_{i,j}^{B}\right] }{\sum _{\mathbf {\forall }i}\sum _{\mathbf {\forall }\,j}\left( w_{i,i}^{A}+w_{i,i}^{B}\right) }, \end{aligned}$$
(9)
$$\begin{aligned}&A M_{i,j}\;=\;\left\{ \begin{array}{ll} {{1,}}&{}{{g_{i,j}^{F}>g_{i,j}^{A}~\textrm{and}~\,g_{i,j}^{F}>g_{i,j}^{B}}}\\ {{0,}}&{}{{\textrm{otherwise}}}\end{array}\right. , \end{aligned}$$
(10)
where \(A M_{i,j}\) indicates locations of fusion artifacts when fused gradients are stronger than input, \(Q_{i,j}^{A F}\) and \(Q_{i,j}^{B F}\) denote the gradient information preservation estimates of source images A and B, respectively, \(w_{i,i}^{A}\) and \(w_{i,i}^{B}\) are the perceptual weights of source images, respectively, \(g_{i,j}^{A}\), \(g_{i,j}^{B}\) and \(g_{i,j}^{F}\) are the edge strength of A, B, and fused image F, respectively. A low Nabf value is indicative of superior visual performance in the fused image.
The SCD, which measures the amount of information transmitted from source images to the fused image, can be represented as:
$$\begin{aligned}&SCD=r(D_{1},S_{1})+r(D_{2},S_{2}), \end{aligned}$$
(11)
where \(r(\cdot )\) denotes the correlation function.
The SF metric effectively measures the gradient distribution of images, which reveals the details and texture of images. It can be defined as follows:
$$\begin{aligned}&S F={\sqrt{R F^{2}+C F^{2}}}, \end{aligned}$$
(12)
$$\begin{aligned}&R F={\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}\left( F(i,j)-F(i,j-1)\right) ^{2}}}, \end{aligned}$$
(13)
$$\begin{aligned}&CF={\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}(F(i,j)-F(i-1,j))^{2}}}, \end{aligned}$$
(14)
where RF and CF are the spatial row frequency (RF) and column frequency (CF) based on horizontal and vertical gradients, respectively.
The CC metric measures the degree of linear correlation between the fused image and the source images, as defined below:
$$\begin{aligned}&CC={\frac{r_{a f}+r_{b f}}{2}}, \end{aligned}$$
(15)
$$\begin{aligned}&r_{x f} = \frac{\sum _{i=1}^{M}\sum _{j=1}^{N}(x_{i,j}-\mu _{x})(f_{i,j}-\mu _{f})}{\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}(x_{i,j}-\mu _{x})^{2}\sum _{i=1}^{M}(f_{i,j}-\mu _{f})^{2}}}, \end{aligned}$$
(16)
where \(\mu _{x}\) and \(\mu _{f}\) indicate the mean values of the input image x and the fused image f, respectively. A higher value of CC indicates a better correlation and higher image quality for the fused image.
The SD reflects the distribution and contrast of the fused image from a statistical perspective and can be defined mathematically as:
$$\begin{aligned}&S D=\sqrt{\sum _{i=1}^{M}\sum _{j=1}^{N}\left( f(i,j)-\mu \right) ^{2}},&\end{aligned}$$
(17)
where \(\mu \) denotes the average of the fused image. A positive SD value indicates that the fused image exhibits favorable visual effects.
The MS_SSIM represents a calibration definition for the difference between two images across scales. The corresponding multi-scale SSIM index is given by:
$$\begin{aligned}&SSIM({x},{y})=[l_{M}({x},{y})]^{\alpha _{M}}\cdot \prod _{j=1}^{M}[c_{j}({x},{y})]^{\beta _{j}}[s_{j}({x},{y})]^{\gamma _{j}}, \end{aligned}$$
(18)
where M is the highest scale, \(\alpha _{M}\), \(\beta _{j}\) and \(\gamma _{j}\) are used to adjust the relative importance of different components, and \(c_{j}({x},{y})\) and \(s_{j}({x},{y})\) provide a comparison of contrast and structure at the j-th scale image, respectively, while \(l_{M}({x}, {y})\) is only the luminance comparison at scale M.
The PSNR is used to evaluate the ratio of peak signal power to noise power and therefore reflects the amount of distortion during the fusion process. This metric is defined as follows:
$$\begin{aligned}&PSNR=10\log _{10}{\frac{r^{2}}{MSE}}, \end{aligned}$$
(19)
where r indicates the peak value of the fused image. The higher PSNR value indicates that the fused image is closer to the source images and has less distortion in terms of image quality.
The FMI is used to measure the amount of feature information transmitted from the source images to the fused image. It is defined as follows:
$$\begin{aligned}&FMI_{F}^{A B}={\frac{1}{n}}\sum _{i=1}^{n}\left( {\frac{I_{i}(A;F)}{H_{i}(A)+H_{i}(F)}}+{\frac{I_{i}(B;F)}{H_{i}(B)+H_{i}(F)}}\right) , \end{aligned}$$
(20)
where \(H_{i}(A)\) and \(H_{i}(B)\) are the entropy of the corresponding windows from the input images, \(I_{i}(A;F)\) and \(I_{i}(B;F)\) indicate the regional mutual information between corresponding windows in the fused image and source images. A larger FMI value commonly implies that a considerable amount of feature information is transferred from the source images to the fused image.

Training details

We train the proposed fusion network on the Multi-Spectral Road Scenarios (MSRS) [33] dataset. This training set includes 1078 pairs of infrared and visible images, while the test set contains 361 image pairs. This dataset is constructed based on MFNet [55] and consists of a large number of nighttime and daytime scenes. Before feeding the training set to the fusion network, all images are normalized to [0, 1] and parameters are set as follows. The total loss hyper-parameter is set to \(\gamma \) = 100, p = 0.68, and q = 0.08. The batch size and epoch are set to 8 and 80, respectively. The model parameters are updated by the Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001. All experiments are performed on an NVIDIA RTX A5000 GPU and a 2.40 GHz Intel(R) Xeon (R) Silver 4214R CPU. Since color visible images are included in MSRS, a specific fusion strategy[43] is used to process color image fusion. We first transfer the input visible images from the RGB color space to the YCbCr color space. The Y channel in the visible images is then employed to fuse the infrared images and obtain a new fused channel Y. Finally, the fused image is combined with the Cb and Cr channels of visible images and converted to the RGB color space.

Results analysis on TNO dataset

We compare the fusion performance for our method with the seven state-of-the-art algorithms applied to 24 image pairs acquired from the TNO dataset. All infrared and visible images display different scenes and are registered before being fed to the network. Samples of these images are shown in Fig. 5.

Qualitative results

Table 1
Average evaluation metric values for all methods apply to 24 image pairs from the TNO dataset
Metrics
GTF
IFEVIP
CBF
FusionGAN
IFCNN
NestFuse
MFEIF
Ours
EN
6.5999
6.6540
6.8784
6.4741
6.6637
6.9888
6.6295
6.9476
SF
0.0373
0.0425
0.0553
0.0259
0.0484
0.0429
0.0290
0.0395
SD
8.8174
8.9208
8.9346
8.2617
8.7769
9.2871
8.8902
9.2430
PSNR
62.7178
62.1595
63.8897
60.9702
64.3641
62.9549
64.5550
64.1134
CC
0.3877
0.4923
0.4377
0.4682
0.5322
0.5226
0.5527
0.5630
SCD
0.9237
1.5413
1.3249
1.2768
1.6169
1.7041
1.7044
1.8574
Nabf
0.0713
0.1189
0.2588
0.0780
0.1779
0.1308
0.0047
0.0993
MS_SSIM
0.8091
0.8443
0.7286
0.7362
0.9022
0.8544
0.8957
0.9462
FMI
0.8953
0.8917
0.8769
0.8788
0.8957
0.8965
0.8983
0.8997
The two best values for each metric are bold and underlined, respectively. The two types of numbers under each method name represent the number of best values and second best values, respectively
For quantitative experiments, fused images produced by existing fusion methods and our proposed method are shown in Figs. 6 and 7. Some representative regions from the fused images are selected and enlarged near the bottom, to more intuitively display and analyze visual effects in the fused results. A significant target is evident in the green box and abundant textural details can be seen in the red box.
Table 2
Average evaluation metric values for all methods apply to 24 image pairs from the RoadScene dataset
Metrics
GTF
IFEVIP
CBF
FusionGAN
IFCNN
NestFuse
MFEIF
Ours
EN
7.524
7.0617
7.4704
7.1238
7.2134
7.5156
7.1476
7.3405
SF
0.0399
0.0555
0.0658
0.0358
0.0630
0.0558
0.0392
0.0545
SD
10.2173
9.8298
10.2358
9.9576
10.0158
10.3017
10.1937
10.3180
PSNR
62.6298
61.5192
63.493
60.5857
64.1579
62.6796
64.1637
63.9071
CC
0.5326
0.6244
0.5673
0.5974
0.6614
0.6628
0.6871
0.6912
SCD
0.9901
1.3149
1.1595
1.0931
1.3921
1.6465
1.5420
1.6868
Nabf
0.0634
0.1593
0.2576
0.1019
0.1796
0.1309
0.0086
0.0786
MS_SSIM
0.7861
0.8361
0.7985
0.7578
0.8991
0.8627
0.8813
0.9088
FMI
0.8628
0.8503
0.8516
0.8486
0.8592
0.8631
0.8627
0.8531
The two best values for each metric are bold and underlined, respectively. The two types of numbers under each method name represent the number of best values and second best values, respectively
As shown, nearly all methods generate some meaningless information due to thermal radiation contamination in the background. However, our method not only highlights the target but also preserves detail information. The region in the green box indicates that although the CBF results include a bright target, the pixel distribution in this area suffers heavily from noise compared to the proposed method. Also, the IFEVIP, GTF, and FusionGAN models severely weaken significant targets in the fused images. In the case of NestFuse, IFCNN, and MFEIF, the fused images indicate that while some of the target edges are highlighted, other salient features and textural details in the fused images are blurred. In contrast, our fusion method produces more realistic contrast and successfully preserves the intensity of significant areas and the texture detail of visible images, compared with other methods. For example, the proposed scheme preserves internal contours and details for cars and clouds intact in Fig. 7. This improvement demonstrates one of the primary advantages of our method.

Quantitative results

Quantitative evaluation experiments are conducted using the TNO dataset, employing nine metrics to comprehensively compare our method with seven state-of-the-art methods. Average values for the compared fusion methods and the proposed algorithm are shown in Table 1 across nine metrics, where the two best values for each metric are bold and underlined, respectively. As demonstrated by the statistical results, the proposed fusion method achieves the largest average values in four of the metrics, including CC, SCD, MS_SSIM, and FMI. It also achieves reasonable performance in EN and SD, producing the second largest average values. Our method also achieves the best performance for SCD, indicating that the correlation between our fused images and the source images is the highest. In addition, the largest average values for CC and MS_SSIM indicate that our fused images transfer more considerable information while preserving structural information in the input images. The values for FMI also prove that our method well preserves feature information from the source images to the fused images. These results indicate that our method can transfer more meaningful information from the source images, especially the richest fine-grained details and significant structural information.

Results analysis on RoadScene dataset

Qualitative results

An additional 24 image pairs showing different day and night scenes are selected from the RoadScene dataset, including cars, streetlights, roads, pedestrians, bicycles, trees, and houses. The fused results produced by different fusion methods are shown in Figs. 8 and 9. It is evident that undesirable artifacts appear in the CBF results, while the GTF and IFEVIP fused images do not retain details from the infrared image. This results in significant information loss, particularly in the red box region. In addition, FusionGAN produces under-exposed results and could not retain the sharp target edges. On the contrary, the NestFuse, IFCNN, MFEIF, and the proposed method obtain better fusion performance in subjective evaluations compared with the other three fusion methods. However, the fused images obtained by the proposed method exhibit more reasonable luminance information.
Table 3
Quantitative comparisons of ablation studies using the TNO dataset
Method
EN
SF
SD
PSNR
CC
SCD
Nabf
MS_SSIM
FMI
without attention
6.7770
0.0300
9.1252
63.9204
0.5477
1.6720
0.0543
0.8728
0.8888
without detail loss
6.8592
0.0277
9.2555
63.9672
0.5472
1.7239
0.1110
0.8774
0.8870
Our
6.9476
0.0395
9.2430
64.1134
0.5630
1.8574
0.0993
0.9462
0.8997
Bold text indicates the best result

Quantitative results

The results of quantitative comparisons between our method and other state-of-the-art algorithms are provided in Table 2. It shows that our method achieves the largest average across four metrics, including SD, CC, SCD, and MS_SSIM. Our proposed method presents the best SD value, indicating the fused images exhibit the highest contrast. In addition, our algorithm produces the highest CC and MS_SSIM values, suggesting the fused results share strong correlation and structural information with the source images. The highest SCD value further implies that our fused images have less pseudo-information and the strongest correlation with source images.
In summary, both qualitative and quantitative results demonstrate that our proposed method achieves excellent performance in transferring more considerable information and highlighting significant contrast, which has remarkable advantages over other methods.

Ablation studies

Multi-scale attention analysis

The multi-scale attention module plays a critical role in our fusion network as it enhances the contextual representation of the network on both local and global features. Therefore, we implement an ablation study using the multi-scale attention module, the results of which are shown in Fig. 10. The multi-scale attention module is excluded from the ablation experiment. It is evident that the fused images preserve texture details in the source images, but with low contrast. In addition, some of the visualized results exhibit a few artifacts.

Detail loss analysis

Ablation experiments are included to determine the role of detail loss in the results. More specifically, we train a network without additional detail loss, the results of which are shown in Fig. 10. Notice that when the detail loss is removed, the fusion network fails to preserve useful information of source images, specifically texture detail in background regions and pixel intensity and contours for salient targets. In addition, the results of quantitative comparisons are provided in Table 3, where all metrics are seen to decrease, excluding the SD metric. These experimental results demonstrate the importance of detail loss, which can preserve the texture details in the fused images.
Table 4
Quantitative comparisons of 21 image pairs from the extended VIFB dataset
Metrics
GTF
IFEVIP
CBF
FusionGAN
IFCNN
NestFuse
MFEIF
Ours
EN
6.5061
6.9566
7.3149
6.3727
6.9083
6.9131
6.8695
7.0286
SF
0.0572
0.0612
0.0789
0.0361
0.0726
0.0579
0.0452
0.0529
SD
9.0553
9.3865
9.7274
8.3456
9.3688
9.5795
9.5951
9.7190
PSNR
61.7115
61.4836
62.4867
62.1179
63.3259
62.7005
63.7930
63.7914
CC
0.4845
0.5546
0.5144
0.5730
0.5942
0.5992
0.6216
0.6323
SCD
0.7584
1.2620
1.0509
0.8902
1.3786
1.4510
1.4925
1.5522
Nabf
0.0936
0.1528
0.3442
0.1092
0.1891
0.0925
0.0209
0.1075
MS_SSIM
0.7630
0.8481
0.7566
0.6757
0.9087
0.8585
0.8991
0.9354
FMI
0.8823
0.8908
0.8841
0.8807
0.8956
0.8949
0.8974
0.8923
The two best values for each metric are bold and underlined, respectively. The two types of numbers under each method name represent the number of best values and second best values, respectively

Efficiency comparisons

To verify the computational efficiency of the fusion algorithm, the traditional methods are tested on the CPU, while the others are implemented on the GPU. As can be seen in Table 5, the average running time of the image fusion algorithms varies widely, and the running times of traditional methods are longer than that of deep learning-based methods that benefit from the GPU acceleration. Specifically, IFCNN with a simple network architecture is the fastest algorithm on all datasets. Our proposed fusion algorithm focuses on features at different scales and makes up for the missing comprehensive information via attention modules. As such, the running time for our method trails only IFCNN. Fortunately, the experiments show that our fusion algorithm has an efficiency advantage compared with other methods and will be thus feasible for real-time applications.
Table 5
Average running time for all methods across three datasets (unit: second)
Method
TNO
RoadScene
VIFB
GTF
6.152
8.448
8.504
IFEVIP
0.078
0.089
0.096
CBF
17.342
26.056
32.184
FusionGAN
0.697
0.440
0.535
IFCNN
0.056
0.056
0.068
NestFuse
3.764
2.175
2.744
MFEIF
0.084
0.068
0.079
Ours
0.061
0.063
0.073
Bold text indicates the best results and underlined text represents the second best results

Extension to the VIFB dataset

To further verify our generalization of the proposed method, the experiment is also conducted using the VIFB dataset, which includes 21 pairs of registered visible and infrared images. These samples not only cover a wide range of environments and working conditions (e.g., indoor, outdoor, low illumination, and over-exposure), they also include various image resolutions, such as \(320\times 240\), \(630\times 460\), \(512\times 184\), and \(452\times 332\).
Fused results for the VIFB dataset are shown in Figs. 11 and 12, where it is evident that GTF, FusionGAN, and NestFuse lose vital information. CBF is also seen to suffer from noise interference and other undesirable artifacts. In addition, IEVIP fails to display significant targets due to overexposure to visible images. In contrast, MFEIf, IFCNN, and the proposed method preserve detail information and highlighted targets from the source images. Quantitative results for the VIFB dataset are provided in Table 4, where it is evident that our method achieves the largest average values across three metrics, including CC, SCD, and MS_SSIM. These metrics indicate the fused results exhibit a meaningful structure and texture information transferred from the source images. In contrast, the proposed method follows CBF in the EN metric because the fused images generated by CBF contain additional noise.

Conclusion

In this paper, a novel lightweight deep learning fusion network based on multi-scale attention and hybrid dilated convolutional blocks is proposed to effectively improve the fusion of infrared and visible images. By designing hybrid dilated convolution blocks, the feature extraction module with a larger receptive field efficiently extracts more contextual information and fine-grained details without changing the size of the feature maps. The use of a unique total loss allows our proposed fusion network to simultaneously preserve texture features and salient target intensity from both infrared and visible images. In addition, the spatial/channel attention modules at different scales are designed to focus on shallow local and deep global detail features, which compensate for missing detail in the fusion process and improve the contrast of fused images. Experiments performed on two public infrared and visible image datasets demonstrate that our fused images not only include large amounts of detailed textural features but also reduce noise and artifacts. In addition, these experiments are extended to the VIFB dataset and further verify the generalizability of our proposed model.

Acknowledgements

This work is supported by the National Key Research and Development Program of China (Grant No.2020YFC0833406), the National Natural Science Foundation of China (NSFC) under the grant No.62102112, the Basic Research Plan of Guizhou Province (Grant No.Qiankehejichu-ZK[2021]Yiban310), the Guizhou Provincial Science and Technology Projects (No. QKHJCZK2022YB195,QKHJCZK2023YB143) and the Youth Science and Technology Talents Cultivating Object of Guizhou Province (No. QJHKY2021104).

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
21.
go back to reference Zhang R (2019) Making convolutional networks shift-invariant again. In: ICML Zhang R (2019) Making convolutional networks shift-invariant again. In: ICML
28.
go back to reference Huang Y, Yao K (2020) Multi-exposure image fusion method based on independent component analysis. In: Proceedings of the 2020 international conference on pattern recognition and intelligent systems. PRIS 2020. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3415048.3416099 Huang Y, Yao K (2020) Multi-exposure image fusion method based on independent component analysis. In: Proceedings of the 2020 international conference on pattern recognition and intelligent systems. PRIS 2020. Association for Computing Machinery, New York, NY, USA. https://​doi.​org/​10.​1145/​3415048.​3416099
40.
go back to reference Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS) Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS)
55.
go back to reference Ha Q, Watanabe K, Karasawa T, Ushiku Y, Harada T (2017) MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 5108–5115. IEEE, Vancouver, BC. https://doi.org/10.1109/IROS.2017.8206396 Ha Q, Watanabe K, Karasawa T, Ushiku Y, Harada T (2017) MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp 5108–5115. IEEE, Vancouver, BC. https://​doi.​org/​10.​1109/​IROS.​2017.​8206396
Metadata
Title
Multi-scale attention-based lightweight network with dilated convolutions for infrared and visible image fusion
Authors
Fuquan Li
Yonghui Zhou
YanLi Chen
Jie Li
ZhiCheng Dong
Mian Tan
Publication date
03-08-2023
Publisher
Springer International Publishing
Published in
Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-01185-2

Other articles of this Issue 1/2024

Complex & Intelligent Systems 1/2024 Go to the issue

Premium Partner