research-article

Open Access

Unifying Dual-Attention and Siamese Transformer Network for Full-Reference Image Quality Assessment

Authors:
Zhenjun Tang

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

0000-0003-3664-1363
View Profile

,
Zhiyuan Chen

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

0000-0001-6888-4859
View Profile

,
Zhixin Li

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

0000-0002-5313-6134
View Profile

,
Bineng Zhong

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

0000-0003-3423-1539
View Profile

,
Xianquan Zhang

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, and Guangxi Key Lab of Multi-Source Information Mining & Security, Guangxi Normal University, China

0000-0003-3359-117X
View Profile

,
Xinpeng Zhang

School of Computer Science, Fudan University, China

School of Computer Science, Fudan University, China

0000-0002-0212-3501
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 6Article No.: 205pp 1–24https://doi.org/10.1145/3597434

Published:12 July 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Image Quality Assessment (IQA) is a critical task of computer vision. Most Full-Reference (FR) IQA methods have limitation in the accurate prediction of perceptual qualities of the traditional distorted images and the Generative Adversarial Networks (GANs) based distorted images. To address this issue, we propose a novel method by Unifying Dual-Attention and Siamese Transformer Network (UniDASTN) for FR-IQA. An important contribution is the spatial attention module composed of a Siamese Transformer Network and a feature fusion block. It can focus on significant regions and effectively maps the perceptual differences between the reference and distorted images to a latent distance for distortion evaluation. Another contribution is the dual-attention strategy that exploits channel attention and spatial attention to aggregate features for enhancing distortion sensitivity. In addition, a novel loss function is designed by jointly exploiting Mean Square Error (MSE), bidirectional Kullback–Leibler divergence, and rank order of quality scores. The designed loss function can offer stable training and thus enables the proposed UniDASTN to effectively learn visual perceptual image quality. Extensive experiments on standard IQA databases are conducted to validate the effectiveness of the proposed UniDASTN. The IQA results demonstrate that the proposed UniDASTN outperforms some state-of-the-art FR-IQA methods on the LIVE, CSIQ, TID2013, and PIPAL databases.

1 INTRODUCTION

With the popularity of smartphones and the rapid development of image processing and communication technology, digital images are easily acquired and have been widely used in many systems [1]. For an image communication system, visual quality distortion may be introduced in every stage of the system, such as acquisition, compression, transmission, and display. As a result, efficient Image Quality Assessment (IQA) is in demand. In fact, IQA is a critical task of computer vision [2] and has attracted much attention in recent years. There are two types of IQA: subjective IQA and objective IQA. Since human observers are the final receiver of visual signals in most visual communication systems [3], subjective assessment by humans is a valid and effective method to gauge perceptual image quality, which is often expressed by the Mean Opinion Score (MOS) of the gathered subjective assessments. In general, subjective assessment is expensive, time-consuming, and labor-intensive. As a result, there is an urgent need to conceive easy and reliable objective IQA methods, which can be extensively employed in the optimization of parameters of various image processing methods [4, 5].

According to the degree of availability of the reference image information, objective IQA methods can be divided into three categories: Full-Reference (FR) IQA methods [6, 7, 8], Reduced-Reference (RR) IQA methods [9, 10, 11], and No-Reference (NR) IQA methods [12, 13, 14]. In the FR-IQA development, the Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR) are two popular FR-IQA methods widely used. However, the MSE and PSNR are not well connected with the perception of Human Visual System (HVS) [15].

To design efficient IQA methods with high prediction accuracy, many researchers conducted extensive explorations to collect image features from the spatial domain or the transform domain, and then performed distance calculations on image features for distortion evaluation. Since the HVS characteristics and structural distortion are the two significant factors affecting the quality of visual perception, they are well considered in developing FR-IQA methods based on traditional hand-crafted features [16, 17, 18, 19, 20, 21, 22, 23].

With the development of deep learning, Convolutional Neural Networks (CNNs) have shown excellent performance in many image applications, such as segmentation and classification. Motivated by the success of CNNs, some researchers have used deep learning techniques to design various IQA methods. An early learning-based method is presented in [24]. This method combines feature learning and regression as a complete optimization process, and can well predict the quality of images. Nowadays, most learning-based methods extract visual sensitivity maps from IQA databases to predict image quality scores [25, 26, 27]. To consider the characteristics of HVS, some researchers designed a visual sensitivity map extraction method that matches human perception and achieved good results [28, 29]. Some other learning-based methods have achieved excellent performance in FR-IQA by focusing on the perceptual error of inter-image information [30, 31, 32]. These research results also show that the “perceptual loss” between images is related to human visual perception and plays a crucial role in quality assessment.

Recently, there is a rapid development in image restoration based on Generative Adversarial Networks (GAN) [33]. The GAN-based image restoration methods can generate images with realistic-looking but unrealistic texture-like noises [34]. However, the existing objective FR-IQA methods, such as the PSNR, Structural SIMilarity (SSIM) [35], have limitation in predicting quality of the GAN-based distorted images. Therefore, effective IQA methods for measuring visual qualities of images produced by GAN-based image restoration methods are needed.

In the past several years, the success of Transformer network [36] in natural language processing motivates and inspires various computer vision tasks [37]. The research results demonstrate that the Transformer network can significantly improve the performance of identification tasks [38, 39, 40] and low-level vision [41]. For the IQA task, several studies based on Transformer [42, 43, 44, 45] have been also investigated and shown advanced performance. However, the Transformer based IQA methods do not reach ideal performance yet. Therefore, more efforts in developing IQA methods based on the Transformer network are required, especially concerning the various structural distortion of images and the characteristics of human visual attention. Furthermore, most deep learning-based IQA methods solely use MSE as a loss function for quality score regression. It is difficult to produce an appropriate quality assessment using this regression approach since it does not take into account the relative ranking relationship of distorted image quality scores.

To address the issue of accurate quality prediction, we propose a novel method by Unifying Dual-Attention and Siamese Transformer Network (UniDASTN) for FR-IQA. Since the proposed UniDASTN incorporates HVS characteristics and exploits the distortion information between the reference image and the distorted image, it can provide an accurate quality assessment of significant regions of human visual attention and obtain better results correlated with human visual perception. The main contributions of our proposed UniDASTN are highlighted below.

–	We propose a novel spatial attention module. This spatial attention module is composed of a novel Siamese Transformer network and a new feature fusion block. It can focus on significant regions in line with human visual perception and thus effectively maps the perceptual differences between the reference and distorted images to a latent distance for distortion evaluation.
–	We propose a novel dual-attention strategy that exploits channel attention and spatial attention to adaptively extract much more significant features for accurate IQA. Specifically, we use a squeeze-and-excitation network to model channel attention and exploit the proposed spatial attention module to learn spatial attention for enhancing the distortion sensitivity of the extracted ViT feature representations.
–	We design a new joint loss function. The designed loss function combines MSE, Bidirectional Kullback–Leibler (KL) divergence, and rank order of quality scores. It can offer stable training and thus enables the proposed UniDASTN to learn visual perceptual image quality effectively.

The rest of this article is organized as follows. Section 2 reviews the related work. Section 3 explains the proposed UniDASTN in detail. Section 4 presents the experimental results of the proposed UniDASTN and the comparison with some state-of-the-art FR-IQA methods. Section 5 concludes our work.

2 RELATED WORK

In general, a reliable FR-IQA method should make an accurate quality prediction for all types of distortion. In the literature, many scholars have tried their best to design some meaningful FR-IQA methods for pursuing high prediction performance. According to the techniques of image feature extraction, the existing FR-IQA methods can be divided into the below three categories.

2.1 Traditional IQA Methods

In the previous few decades, a lot of research has been done on FR-IQA [2]. The aim of the FR-IQA is to predict the quality of a target image with full access to its reference image. Many FR-IQA methods follow a two-stage framework, which firstly calculates the distance between the distorted and reference image features and then translates the distance to the quality ratings. The most commonly used FR-IQA methods are MSE and PSNR [46], which have been successfully used in many visual applications due to their simplicity and easy use [47]. However, the two methods measure the differences between images by pixel-level differences, which differ from the perceptual consistency of the HVS [15].

The HVS can extract structural information from the scene in a highly adaptive manner. To conduct IQA from structural similarity, Wang et al. [35] proposed a useful technique called SSIM to evaluate quality of the distorted image in terms of brightness, structure, and contrast. Then, an improved method called Multi-Scale SSIM (MS-SSIM) [8] was proposed by extending SSIM to multiple scales. Subsequently, many FR-IQA methods gradually extract the structural information of images in the spatial domain, such as PSIM [16] and GMSD [21]. They both extract gradient features in the spatial domain to describe structural changes. However, the FR-IQA methods designed from the perspective of image structure are inefficient in predicting the quality of specific distortion types, such as blurring and compression. Based on these observations, researchers have started to investigate IQA methods based on low-level features, such as color, edge, and texture. For example, Zhang et al. [23] designed an efficient method called Feature SIMilarity (FSIM) to understand image scenes by some low-level features. In another work, Zhang et al. [22] proposed a visual evaluation model to calculate Visual Saliency-induced Index (VSI). Subsequently, some other useful IQA methods are proposed, such as Most Apparent Distortion (MAD) [18] and Normalized Laplacian PyramiD (NLPD) [17].

Furthermore, some IQA methods try to extract features from transform domain because image quality degradation can be also shown in the transform domain. For example, Sheikh et al. investigated the HVS with the Information Fidelity Criterion (IFC) index [20] and proposed the Visual Information Fidelity (VIF) [19] method for quantifying the information errors between the distorted image and the reference image.

2.2 CNN-Based IQA Methods

There have been some attempts to address FR-IQA tasks with CNN techniques due to the success of CNN in many vision applications. For example, Kang et al. [24] were the first to apply CNN to the IQA task without the use of any existing features. Similar to other computer vision tasks, this method combines feature learning and regression as complete optimization schemes. Specifically, several patches are cropped out of the image, and each patch is given a label corresponding to the visual quality score of the image for training. Later, visual sensitivity information of images is combined with CNN to predict image quality scores. In another work, Kim and Lee [27] exploited CNN to design a new FR-IQA method that learns human visual sensitivity to predict subjective scores by feeding distorted images and their objective error map into the network.

Recently, Zhang et al. [32] designed a new FR-IQA method called LPIPS (Learned Perceptual Image Patch Similarity). The trained deep features of LPIPS optimized by the Euclidean distance between the reference image and the distorted image are effective in FR-IQA. In another work, Prashnani et al. [31] presented a new FR-IQA framework called PieAPP, which is based on pairwise learning. The results of PieAPP show that pairwise learning helps to improve image quality prediction. Subsequently, Ding et al. [30] introduced a novel FR-IQA method that unifies structure and texture similarity. This method shows good performance in predicting visual qualities of textural images and natural images.

2.3 Transformer-Based IQA Methods

The Vision Transformer (ViT) [38] is the famous work about the Transformer in computer vision. This work has demonstrated that the Transformer may “fully” replace the convolutions regularly used in deep neural networks. In general, a pure Transformer architecture includes three important components, i.e., Multi-Head Attention (MHA), Multi-Layer Perceptron (MLP), and Layer Normalization (LN). The ViT exploits a pure Transformer to conduct image classification via taking an image as a sequence of patches. The ViT can learn with minimum inductive bias. Compared with CNNs, the relational representation learned by the Transformer is more general and robust than the local patterns of convolutional modules [48].

The ViT is an efficient technique based on self-attention. Since the self-attention layer collects global information from the whole input sequence, it can measure “perceptual loss” between the distorted image and the reference image when predicting image quality. This design might be helpful in the FR-IQA task because recent studies [42, 43, 44, 45] have proved the usefulness of deep features in the IQA task.

Inspired by ViT, You et al. [45] proposed the first Transformer-based IQA model named TRIQ. The TRIQ uses CNN to extract feature maps which are sent to a shallow Transformer encoder to solve blind IQA tasks. Taking the advantage of the inductive capability of CNN architecture and Transformer encoder for aggregated representation of attention mechanism, the TRIQ achieves outstanding performance. Subsequently, Cheon et al. [42] designed a new Transformer model named IQT for FR-IQA. In the IQT, the encoder-decoder architecture is employed to measure similarity. In another work, Ke et al. [44] used the Transformer to build a new method called MUSIQ (MUlti-Scale Image Quality). The MUSIQ can handle images with different resolutions by mapping input image to a multi-scale representation.

3 PROPOSED METHOD

In this section, we introduce the overall of the proposed UniDASTN. As shown in Figure 1, the proposed UniDASTN takes pairs of reference images and distorted images as input, and it consists of five key components: a feature extraction module, a channel attention module, a feature embedding module, a spatial attention module, and a prediction head. Sections 3.1–3.5 present the five components in detail. Section 3.6 explains our loss function.

Fig. 1. Overview of the proposed UniDASTN. The proposed UniDASTN takes pairs of distorted images and reference images as input and then produces feature maps using a pre-trained ViT. The feature maps are input to a Squeeze-and-Excitation network to focus on the most informative channel features before being converted to fixed-size vectors, and then fed into a spatial attention module for training. The final output is a score for assessing image quality.

3.1 Feature Extraction Module

As depicted in Figure 1, the front part of the feature extraction module is a two-branch pre-trained ViT, which acts as a feature extraction backbone that mainly focuses on extracting low-level features of color, texture, shape information, and high-level features of semantic information. We crop out M patches for each input reference and distorted images separately. Let \(p_{i}^{ref}\) and \(p_{i}^{dist}\) be the \({i}\)th patches of the input reference and distorted images, where \(i\in \lbrace 1,2,\ldots , M \rbrace\). The \({i}\)th input patch \(p_{i}^{ref}\) of the reference image and the \({i}\)th input patch \(p_{i}^{dist}\) of the distorted image are input to the feature extractor to obtain the corresponding feature representations which are extracted from four intermediate layers of ViT, i.e., \(layer \in \lbrace 3,5,8,11\rbrace\). For these feature representations, each \(f_{layer} \in \mathbb {R}^{ c\times w\times h}\) has the same shape. Finally, we concatenate these four feature maps by channel dimension to get a total feature map \(f_{ref} \in \mathbb {R}^{C\times w\times h }\) and \(f_{dist} \in \mathbb {R}^{C\times w\times h }\), where \(C=c\times 4\).

To construct perceptual differences between the reference and distorted images, a simple subtraction operation between the feature maps \(f_{ref}\) and \(f_{dist}\) is conducted to get the difference map \(f_{diff}\) as follows: (1) \(\begin{align} f_{diff} = f_{ref} - f_{dist}. \end{align}\)

To illustrate the effectiveness of the difference map for FR-IQA, some visual examples are presented in Figure 2. In this figure, the first row shows a reference image and its distorted images with different distortions introduced by additive white Gaussian noise. The second row shows their corresponding feature maps, and the third row shows the difference maps. It can be seen that as more distortions are introduced to the image, there are significant changes in the difference map. This means that the difference map can be used for FR-IQA.

Fig. 2. A reference image, its distorted images, their feature maps, and difference maps.

After that, these feature maps are sent to a 1 × 1 convolutional layer to aggregate the information of each channel and downsampling.

3.2 Channel Attention Module

The channel attention module is made up of a Squeeze-and-Excite network. Firstly, the feature maps generated by the feature extraction module are squeezed to acquire global features at the channel level. The global features are then subjected to an excitation operation to learn the channel relationship and acquire the channel weights. Finally, the channel weights are multiplied by the input feature maps to generate the final features.

In summary, the channel attention module performs the channel rating action. This attention mechanism enables the network to prioritize relevant channel features while suppressing unimportant channel features. Note that the maps \(f_{ref}\), \(f_{dist}\), and \(f_{diff}\) are separately fed to the channel attention module. Take \(f_{diff}\) as an example for illustrating the formal descriptions of the whole process as follows: (2) \(\begin{align} z_k &= F_{sq}(f_{diff})=\frac{1}{h\times w} \sum _{i=1}^{h} \sum _{j=1}^{w} f_{diff}(k,i,j), (k=1, 2, \dots , C), \end{align}\) (3) \(\begin{align} z = [z_1, z_2, \dots , z_C], \end{align}\) (4) \(\begin{align} s = F_{ex}(z)=Sigmod(FC(Relu(FC(z)))), \end{align}\) (5) \(\begin{align} \tilde{f_{diff}} = (f_{diff}\cdot s)+f_{diff.} \end{align}\)

3.3 Feature Embedding Module

To feed these three feature maps into the spatial attention module for training, we need to perform feature embedding operations on them.

Since our siamese transformer network utilizes constant latent vectors sized D in all layers, the feature maps are reshaped by spatial dimension to convert to N without destroying the pixel information. Thus, the intermediate embedding features can be determined by the below equation, where \(N=h \times w\). (6) \( \begin{equation} z^{diff} = \left[ \tilde{f_{diff}^{1}}; \tilde{f_{diff}^{2}}; \cdots ; \tilde{f_{diff}^{N}} \right] \in \mathbb {R}^{N \times D }. \end{equation} \)

Similar to typical ViT [38] models, we prepare a learnable embedding called “quality token” \(f_{qua}\) to the sequence of \(z^{diff}\), whose state at the output of the network is viewed as the predicted quality. Here the function of “quality token” is similar to the “classification token” in the classical ViT. In addition, the position embedding \(f_{pos}^{diff} \in \mathbb {R}^{\left(N+1 \right)\times D }\) is also added to \(z^{diff}\) for maintaining positional information. Consequently, an embedding feature with positional information is available as follows: (7) \(\begin{equation} z_{0}^{diff} = \left[ f_{qua }; \tilde{f_{diff}^{1}}; \cdots ; \tilde{f_{diff}^{N}} \right] + f_{pos}^{diff}. \end{equation}\)

Similarly, the embedding features of \(f_{ref}\) and \(f_{dist}\) can be determined by the Equations (8) and (9), respectively. (8) \(\begin{align} &z_{0}^{ref} =\left[ f_{qua}; \tilde{f_{ref}^{1}};\cdots ; \tilde{f_{ref}^{N}} \right] + f_{pos}^{ref}, \end{align}\) (9) \(\begin{align} &z_{0}^{dist} =\left[ f_{qua}; \tilde{f_{dist}^{1}};\cdots ; \tilde{f_{dist}^{N}} \right] + f_{pos}^{dist}. \end{align}\)

A visual example of feature embedding is provided in Figure 3 to facilitate comprehension. The feature embedding module’s inputs are feature maps of size \(28 \times 28 \times 256\) output from the channel attention module. We perform a reshape operation on the feature map by the spatial dimension that produces the intermediate embedding features sized \(784 \times 256\). Finally, the final embedding features sized \(785 \times 256\) are obtained by embedding position and quality. Next, these embedded features are sent to a spatial attention module for training.

Fig. 3. Detailed process of feature embedding module.

3.4 Spatial Attention Module

The spatial attention module consists of a novel Siamese Transformer network and a new feature fusion block. It has three inputs, i.e., embedding features of the distorted image, embedding features of the reference image, and embedding features of the perceptual differences between distorted and reference images. Unlike conventional FR-IQA methods that only perform distance calculation between the reference and distorted images, the spatial attention module employs a Siamese Transformer network to learn information from perceptual differences between the distorted image features and the reference image features. This spatial attention module exploits the latent distance to calculate the perceptual differences between the reference and distorted images for quality prediction.

As shown in Figure 3, we connect each pixel in the feature map by channel dimension and use the concatenated vector as the token input to the siamese transformer network. As a result, the MHA employed in the transformer encoder and the transformer decoder will act as special spatial attention. Its role is to assign various weights to different pixels and thus allow the network to focus on significant features while suppressing irrelevant ones. This strategy can better learn perceptual image features.

In summary, the calculations of the spatial attention module can be formulated as Equations (10)–(23), where L denotes the number of the module layers. (10) \(\begin{align} &q_{l}= k_{l}= v_{l}=z_{l-1}^{diff}, \end{align}\) (11) \(\begin{align} &z_{l}^{\ast } = \mathrm{LN}\left(\mathrm{MSA}\left(q_{l},k_{l},v_{l} \right) + z_{l-1}^{diff} \right), \end{align}\) (12) \(\begin{align} &z_{l}^{diff} = \mathrm{LN}\left(\mathrm{MLP}\left(z_{l}^{\ast } \right) + z_{l}^{\ast } \right) ,l=1, 2, \ldots , L , \end{align}\) (13) \(\begin{align} &z_{L}^{diff}=\left[ z_{E_{qua}},z_{E_{0}},z_{E_{1}},\ldots ,z_{E_{N}} \right], \end{align}\) where MSA(\(\cdot\)) is the operation of Multi-head Self-Attention (MSA).

The inputs of one transformer decoder are \(z_{0}^{dist}\), and \(z_{L}^{diff}\) which is the output of the transformer encoder. Similarly, the inputs of the other transformer decoder are \(z_{0}^{ref}\) and \(z_{L}^{diff}\). The calculations of one single branch of the process of spatial attention module can be formulated as Equations (14)–(18). (14) \(\begin{align} &q_{l}= k_{l}= v_{l}=z_{l-1}^{ref}, \end{align}\) (15) \(\begin{align} &z_{l}^{\ast } = \mathrm{LN}\left(\mathrm{MSA}\left(q_{l},k_{l},v_{l} \right) + z_{l-1}^{ref} \right), \end{align}\) (16) \(\begin{align} &\hat{q_{l} } = z_{l}^{\ast },\hat{k_{l} }=\hat{v_{l} }=z_{L}^{diff}, \end{align}\) (17) \(\begin{align} &z_{l}^{\ast \ast }=\mathrm{LN}\left(\mathrm{MCA}\left(\hat{q_{l} }, \hat{k_{l} },\hat{v_{l} }\right) + z_{l}^{\ast } \right), \end{align}\) (18) \(\begin{align} &z_{l}^{ref }=\mathrm{LN}\left(\mathrm{MLP}\left(z_{l}^{\ast \ast } \right) +z_{l}^{\ast \ast } \right) ,l=1, 2, \ldots , L, \end{align}\) where MCA(\(\cdot\)) is the operation of Multi-head Cross-Attention (MCA).

Note that the outputs of the siamese transformer network are \(z_{L}^{ref\sim diff}\) and \(z_{L}^{dist\sim diff}\), which are defined by Equations (19)–(20). (19) \(\begin{align} &z_{L}^{ref\sim diff} = [ z_{R_{qua}},z_{R_{0}},z_{R_{1}},\ldots ,z_{R_{N}}], \end{align}\) (20) \(\begin{align} &z_{L}^{dist\sim diff} = [ z_{D_{qua}},z_{D_{0}},z_{D_{1}},\ldots ,z_{D_{N}}]. \end{align}\)

Obviously, the output features \(z_{L}^{ref\sim diff}\) and \(z_{L}^{dist\sim diff}\) are complementary. Therefore, it is necessary to aggregate the features of both branches. To do so, an additional feature fusion block is designed. Figure 4 depicts the diagram of our proposed feature fusion block. The proposed feature fusion block can efficiently fuse information from both branches of the siamese transformer network. It makes the proposed UniDASTN more effective in predicting image quality.

Fig. 4. Diagram of the proposed feature fusion block.

Specifically, a sequence of tokens called \(\tilde{z_{i}}\) is constructed by concatenating all tokens in \(z_{L}^{ref\sim diff}\) and \(z_{L}^{dist\sim diff}\) except for the “quality token”. And then a new “quality token” \(\tilde{z_{qua}}\) is determined by the summation of \(z_{R_{qua}}\) and \(z_{D_{qua}}\). Consequently, a new sequence of tokens called \(\tilde{z}\) can be obtained by using \(\tilde{z_{qua}}\) and \(\tilde{z_{i}}\). Formal descriptions of these calculations are defined in Equations (21)–(23). Next, \(\tilde{z}\) is delivered to a transformer encoder for learning information from every token to help to improve the prediction performance. (21) \(\begin{align} &\tilde{z_{i} } = \mathrm{concatenate}\left(z_{R_{0} },\ldots ,z_{R_{N} },z_{D_{0} },\ldots ,z_{D_{N} } \right) , \end{align}\) (22) \(\begin{align} &\tilde{z_{qua}} = z_{R_{qua} } + z_{D_{qua} } , \end{align}\) (23) \(\begin{align} &\tilde{z} = \left[ \tilde{z_{qua}},\tilde{z_{i} } \right]. \end{align}\)

3.5 Prediction Head

The quality score is finally determined in the prediction head. The “quality token” \(\tilde{z_{qua}} \in \mathbb {R}^{1\times D }\) output by the spatial attention module contains the quality information. It is input to the prediction head. The prediction head is achieved by MLP which is composed of two fully connected layers. In the MLP, the first layer is activated by the ReLU, and the second layer only outputs one channel to predict a score, i.e., the final predicted score.

3.6 Loss Function

The novel loss function is designed to evaluate the predicted image scores and optimize the proposed model. Note that the MSE loss is sensitive to outliers. If just the MSE loss is utilized as the loss function for training, the training process is subject to substantial gradient changes caused by outliers, making loss convergence unstable.

To obtain more accurate image prediction scores, we design a novel joint loss function to optimize the proposed UniDASTN for stable training. Our loss function consists of three components, including the MSE loss \(L_{MSE}\), the bidirectional KL divergence loss \(L_{BKL}\), and the rank loss \(L_{RANK}\). The MSE loss \(L_{MSE}\) is defined as (24) \(\begin{equation} L_{MSE} = \sum _{i=1}^{K} MSE(p_i,\hat{q_i}) , \end{equation}\) where \(p_i\) and \(\hat{q_i}\) are the predicted score and the MOS of the ith image, \(MSE(\cdot ,\cdot)\) is the standard MSE, and K is the image number.

We select KL divergence as one component of the loss function because we want to use KL divergence to calculate the distance between the two distributions of prediction scores and MOS values in space. However, since KL divergence assesses the loss of information from one distribution compared to the other, it does not measure the proximity of two distributions in space. As a result, we improve the original KL divergence and create a bidirectional KL divergence for the IQA task to measure the differences between predicted scores and MOS values.

The bidirectional KL divergence loss \(L_{BKL}\) is defined as (25) \(\begin{equation} \begin{aligned}L_{BKL} &= KL\left(Q||\frac{1}{2}(Q+\hat{Q})\right)+KL\left(\hat{Q}||\frac{1}{2}(Q+\hat{Q})\right)\\ &=\sum _{i=1}^{K}Q\times log\frac{Q}{\frac{1}{2}(Q+\hat{Q})} +\sum _{i=1}^{K}\hat{Q}\times log\frac{Q}{\frac{1}{2}(Q+\hat{Q})} \end{aligned} \end{equation}\) where \(\hat{Q}\) and Q are the outputs of softmax regression defined as follows: (26) \(\begin{equation} \hat{Q} = \frac{\hat{e^{q_i}}}{\sum _{i=1}^{K}\hat{e^{q_i}} } , \end{equation}\) (27) \(\begin{equation} Q = \frac{e^{p_i} }{\sum _{i=1}^{K}e^{p_i} } . \end{equation}\)

Furthermore, only considering the regression ability of the network on the quality scores does not consistent with the perception of human vision. That is because, in terms of regression, the same degree of regression values is considered to achieve the same performance results. However, in terms of rank order, regression values of the same degree do not guarantee that the results predicted by the network for distorted images are of the same order as the MOS values.

Motivated by the concept of “rank loss”, we consider that the rank order of the prediction scores should be consistent with the ground truth for reflecting the rank order among the degraded images in the design of the loss function of the network. In a previous study [31], the network is trained by considering the probability of preference for each image pair with different distortions. In another study [29], the authors think that when the rank of the output scores does not agree with the rank of the ground truth scores regardless of their distortion types, the network should be penalized. Inspired by [29], we devise the following pairwise rank loss \(L_{r}(I_i,I_j)\) to determine the rank order of a pair of distorted images \(I_i\) and \(I_j\) as follows: (28) \(\begin{align} L_{r}(I_i,I_j) &= max\left(0,\frac{-(\hat{M}_i -\hat{M}_j)(M_i -M_j)}{|\hat{M}_i -\hat{M}_j |+ \varepsilon } \right)\times \mu , \end{align}\) (29) \(\begin{align} \mu &= (M_i -\hat{M}_i)^{2}+(M_j -\hat{M}_j)^{2}, \end{align}\) where \(\varepsilon\) is a small stability term and \(\mu\) is a penalty term. Note that \(L_{r}(I_i,I_j)\) is always 0 if the rank order of the predicted scores exactly matches that of MOS values. Otherwise, it is a positive number that penalizes the network. Therefore, we can obtain the total rank loss \(L_{RANK}\) by considering all pair rank losses as follows: (30) \(\begin{equation} \begin{aligned}L_{RANK} &= L_{r}(I_1,I_2) + L_{r}(I_1,I_3) + \dots + L_{r}(I_1,I_K)\\ &\quad +L_{r}(I_2,I_3) + L_{r}(I_2,I_4) + \dots + L_{r}(I_2,I_K)\\ &\quad +\dots \\ &\quad +L_{r}(I_i,I_{i+1}) + L_{r}(I_i,I_{i+2}) + \dots + L_{r}(I_i,I_K)\\ &\quad +\dots \\ &\quad +L_{r}(I_{K-1},I_K) \end{aligned} \end{equation}\)

Consequently, the total loss can be assessed as a weighted sum of these three loss items as follows: (31) \(\begin{equation} L_{total} = \alpha L_{MSE}(M,\hat{M}) + \beta L_{BKL}(Q,\hat{Q}) + \gamma L_{RANK} , \end{equation}\) where \(\alpha\), \(\beta ,\) and \(\gamma\) are the hyper-parameters for adjusting the weights.

4 EXPERIMENTAL RESULTS

4.1 Implementation Details of Our UniDASTN

The implementation details of the proposed UniDASTN are provided here. In the training phase, the hyper-parameters of the loss function are \(\alpha =1.0\), \(\beta =0.5,\) and \(\gamma =0.1\). Our used data augmentation strategies include horizontal flip, random rotation, and random cropping. Specifically, we do a horizontal flip with a probability of 0.5. With a probability of 0.25, we do rotations of 90, 180, or 270 degrees, as well as no rotation. Next, the images are randomly cropped to provide \(224\times 224\times 3\) image patches for the feature extraction backbone. The training is carried out with a batch size of 8 and employs an ADAM optimizer for 300 epochs using the designed joint loss function. An initial learning rate is set to 1e-4, and a weight decay of 1e-5 is used. We set the learning rate of each parameter by using a cosine annealing schedule, where \(\eta _{max}\) is set to the initial learning rate and \(\eta _{min}\) is set to 0 and the number of cosine learning rate period \(T_{cur}\) is 50. The value of \(\varepsilon\) used in the loss function is set to 1e-8.

In the testing phase, we perform test-time augmentation on the images by cropping into M overlapping patches of size \(224\times 224\times 3\) and predict the quality score by averaging the scores of all crops, where M is set to 20.

The hyper-parameters of the proposed UniDASTN are as follows. (1) The number of layers in the spatial attention module is 2, i.e., \(L=2\). (2) The head numbers of MSA and MCA are both 8. (3) The dimension of per head is 32. (4) The dimension of Transformer is 256, i.e., \(D=256\). (5) The dimension of the MLP in Transformer is 1024. (6) The dimension of the first layer in prediction head is 512, i.e., \(Dhead=512\).

4.2 Evaluation Criteria

In the following experiments, three popular IQA metrics are employed to quantify performance. These three metrics are: the Pearson Linear Correlation Coefficient (PLCC) [49], the Spearman Rank-order Correlation Coefficient (SRCC) [50], and the Kendall Rank-order Correlation Coefficient (KRCC) [51]. The PLCC is used to assess the ability of predicting subjective scores with low error. The SRCC and KRCC are used to measure the prediction monotonicity of an FR-IQA method. The ranges of PLCC, SRCC, and KRCC are all \([-1, 1]\). A positive value indicates a positive correlation between the two sets of data, and a higher value indicates better performance.

4.3 Datasets

We use five open databases to conduct extensive experiments. Specifically, the used databases include the TID2013 [52], the Categorical Subjective Image Quality (CSIQ) database [18], the LIVE IQA database [53], KADID-10k [54], and PIPAL [34]. These five databases are commonly used in IQA research. Table 1 summarizes the basic information of these databases. Traditional IQA databases are built for general IQA tasks, and they include some common types of distortions, such as JPEG compression, JPEG2000 compression, white noise injection and blurring.

Table 1.

Database	Ref Imgs	Dist Imgs	Distored Type	Num of Types	Score Type	Env
KADID-10k	81	10.1k	traditional	25	DMOS	crowdsourcing
LIVE	29	779	traditional	5	DMOS	lab
CSIQ	30	866	traditional	6	DMOS	lab
TID2013	25	3000	traditional	24	MOS	lab
PIPAL	200	232k	tradi + alg. outputs	40	MOS	crowdsourcing

View Table

Table 1. IQA Databases for Model Training and Performance Evaluation

4.4 Performance Comparison

To demonstrate the advantage, we conduct two types of experiments to compare our UniDASTN and some state-of-the-art FR-IQA methods. In the comparisons, conventional FR-IQA methods and deep learning-based FR-IQA methods are both selected. The first type is to validate the quality prediction performance on the same dataset. This evaluation experiment is to test the accuracy of the quality prediction of each objective IQA method. The other type is to validate each FR-IQA method across datasets. Such evaluation experiment is to verify the generalization ability of each objective FR-IQA method.

Evaluation on the same dataset. According to the suggestion [25, 29, 59], we randomly divide the dataset into training set (80% images) and test set (20% images) based on the reference image. During the network training, only the data from the training set are visible to the FR-IQA methods. We determine the weights loaded by model based on the training and use the test set to evaluate the final performance. We repeat the experiment five times with different seeds, and the results are averaged for a fair comparison.

We compare our UniDASTN method with some state-of-the-art FR-IQA methods on the traditional distorted types of datasets, i.e., LIVE, CSIQ, and TID2013. In this experiment, our UniDASTN is compared with a total of 17 FR-IQA methods, including 10 non-deep-learning-based methods and 7 deep-learning-based methods. The 10 non-deep-learning-based methods are PSNR [46], SSIM [35], MS-SSIM [8], FSIMc [23], VSI [22], MAD [18], VIF [19], NLPD [17], GMSD [21], and SCQI [55]. The seven deep-learning-based methods are DOG-SSIMc [65], DeepQA [27], DualCNN [66], WaDiQaM-FR [25], PieAPP [31], JND-SalCAR [29], and AHIQ [59].

The PLCC and SRCC results of these methods are shown in Table 2. Note that the results of the compared methods are taken from the corresponding original papers and [30]. In Table 2, the best results are in bold and the missing results are marked with “–”. It can be seen that our UniDASTN consistently outperforms all state-of-the-art FR-IQA methods in terms of PLCC and SRCC on three IQA datasets. Furthermore, Table 2 shows that our UniDASTN made significant progress in CSIQ and TID2013 when compared to other state-of-the-art FR-IQA methods. Although some existing methods have achieved high levels of visual quality score assessment for the LIVE dataset, our UniDASTN still makes some progress on the SRCC metric. These results illustrate that our UniDASTN achieves solid improvements over the previous methods and can cope well with traditional distortion types of images.

Table 2.

Method	LIVE		CSIQ		TID2013
Method	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC
PSNR [46]	0.865	0.873	0.819	0.810	0.677	0.687
SSIM [35]	0.937	0.948	0.852	0.865	0.777	0.727
MS-SSIM [8]	0.940	0.951	0.889	0.906	0.830	0.786
FSIMc [23]	0.961	0.965	0.919	0.931	0.877	0.851
VSI [22]	0.948	0.952	0.928	0.942	0.900	0.897
MAD [18]	0.968	0.967	0.950	0.947	0.827	0.781
VIF [19]	0.960	0.964	0.913	0.911	0.771	0.677
NLPD [17]	0.932	0.937	0.923	0.932	0.839	0.800
GMSD [21]	0.957	0.960	0.945	0.950	0.855	0.804
SCQI [55]	0.937	0.948	0.927	0.943	0.907	0.905
DOG-SSIMc [65]	0.966	0.963	0.943	0.954	0.934	0.926
DeepQA [27]	0.982	0.981	0.965	0.961	0.947	0.939
DualCNN [66]	-	-	-	-	0.924	0.926
WaDiQaM-FR [25]	0.980	0.970	-	-	0.946	0.940
PieAPP [31]	0.986	0.977	0.975	0.973	0.946	0.945
JND-SalCAR [29]	0.987	0.984	0.977	0.976	0.956	0.949
AHIQ [59]	0.989	0.984	0.978	0.975	0.968	0.962
Our UniDASTN	0.989	0.987	0.987	0.983	0.977	0.974

View Table

Table 2. Performance Comparison on Same Dataset Experiment

Furthermore, to validate the effectiveness of the proposed UniDASTN in predicting the quality of images generated by GAN-based image restoration methods, we also compare our UniDASTN method with some state-of-the-art FR-IQA methods on the GAN-based synthetic image dataset, i.e., the PIPAL dataset. The PIPAL dataset contains many traditional distortion types of images and some distorted images generated by GAN-based image restoration methods. In this experiment, our UniDASTN is compared with a total of 19 FR-IQA methods, including 14 non-deep-learning-based methods and 5 deep-learning-based methods. The 14 non-deep-learning-based FR-IQA methods are PSNR [46], NQM [57], UQI [69], SSIM [35], MS-SSIM [8], RFSIM [68], GSM [61], SRSIM [60], FSIM [23], VSI [22], NIQE [63], MA [62], PI [56], and Brisque [64]. The five deep-learning-based methods are LPIPS-Alex [32], LPIPS-VGG [32], DISTS [30], IQT [42], and AHIQ [59]. All deep-learning-based methods are trained on the PIPAL training set, and the images from the validation set of PIPAL are used for the test.

As can be seen from Table 3, the best results are in bold, and our UniDASTN achieves the second-best result in the SRCC performance comparison and the third-best result in the PLCC performance comparison. Although our UniDASTN shows a slightly lower performance on the PIPAL dataset, our UniDASTN still outperforms most compared methods. This experiment demonstrates that our UniDASTN is effective in assessing the quality of GAN-based synthetic images.

Table 3.

Method	PIPAL Validation
Method	PLCC	SRCC
PSNR [46]	0.269	0.234
NQM [57]	0.364	0.302
UQI [69]	0.505	0.461
SSIM [35]	0.377	0.319
MS-SSIM [8]	0.119	0.338
RFSIM [68]	0.285	0.254
GSM [61]	0.450	0.379
SRSIM [60]	0.626	0.529
FSIM [23]	0.553	0.452
VSI [22]	0.493	0.411
NIQE [63]	0.129	0.012
MA [62]	0.097	0.099
PI [56]	0.134	0.064
Brisque [64]	0.052	0.008
LPIPS-Alex [32]	0.606	0.569
LPIPS-VGG [32]	0.611	0.551
DISTS [30]	0.634	0.608
IQT [42]	0.840	0.820
AHIQ [59]	0.845	0.835
Our UniDASTN	0.831	0.829

View Table

Table 3. Performance Comparison on PIPAL Validation Dataset

Evaluation on the cross-dataset. We conduct a cross-dataset experiment to evaluate the generality of our UniDASTN. In this experiment, the FR-IQA methods are trained on a single dataset and thus tested on several other databases to validate their generalization ability in IQA.

As recommended by [30, 31, 32, 42], we use the KADID-10k dataset for training the deep-learning-based FR-IQA methods, and evaluate their performances on the full set of the CSIQ, the LIVE, and the TID2013. For non-deep-learning-based FR-IQA methods, we directly assess their performance.

In this experiment, our UniDASTN is compared with a total of 14 FR-IQA methods, including 9 non-deep-learning-based methods and 5 deep-learning-based methods. The nine non-deep-learning-based FR-IQA methods are PSNR [46], SSIM [35], MS-SSIM [8], VSI [22], MAD [18], VIF [19], FSIMc [23], NLPD [17], and GMSD [21]. The five deep-learning-based methods are DISTS [30], IQT [42], PieAPP [31], LPIPS-Alex [32], and LPIPS-VGG [32].

The PLCC, SRCC, and KRCC results of these methods are shown in Table 4, where the best results are in bold. In this cross-dataset experiment, it can be seen that our UniDASTN consistently outperforms all compared FR-IQA methods in terms of PLCC, SRCC, and KRCC. In particular, for TID2013, our UniDASTN shows a significant performance improvement compared to some state-of-the-art FR-IQA methods. This experiment illustrates that our UniDASTN achieves satisfactory generalization ability.

Table 4.

Method	CSIQ			LIVE			TID2013
Method	PLCC	SRCC	KRCC	PLCC	SRCC	KRCC	PLCC	SRCC	KRCC
PSNR [46]	0.796	0.806	0.608	0.872	0.876	0.687	0.702	0.639	0.470
SSIM [35]	0.861	0.876	0.690	0.945	0.948	0.796	0.790	0.742	0.559
MS-SSIM [8]	0.899	0.913	0.739	0.949	0.951	0.804	0.832	0.786	0.605
VSI [22]	0.928	0.942	0.786	0.948	0.952	0.805	0.900	0.897	0.718
MAD [18]	0.950	0.951	0.802	0.960	0.961	0.824	0.796	0.773	0.587
VIF [19]	0.928	0.920	0.754	0.960	0.964	0.828	0.772	0.677	0.515
FSIMc [23]	0.919	0.931	0.769	0.961	0.965	0.836	0.877	0.851	0.667
NLPD [17]	0.923	0.932	0.769	0.932	0.937	0.778	0.839	0.800	0.625
GMSD [21]	0.954	0.957	0.812	0.960	0.960	0.827	0.859	0.804	0.634
DISTS [30]	0.936	0.939	0.780	0.951	0.954	0.811	0.868	0.830	0.639
IQT [42]	0.943	0.939	0.799	0.970	0.968	0.849	0.899	0.891	0.717
PieAPP [31]	0.890	0.897	0.712	0.910	0.918	0.749	0.831	0.844	0.657
LPIPS-Alex [32]	0.893	0.928	0.757	0.900	0.922	0.751	0.787	0.769	0.568
LPIPS-VGG [32]	0.901	0.884	0.698	0.938	0.941	0.780	0.743	0.679	0.505
Our UniDASTN	0.969	0.967	0.827	0.976	0.973	0.861	0.917	0.905	0.724

View Table

Table 4. Performance Comparison on Cross-Dataset Experiment

In addition, it can be seen that the performances of some compared methods are not stable among different databases. For example, the VSI [22] achieves good results in all metrics on the TID2013 database, but its performances on the CSIQ and LIVE databases degrade significantly. Similarly, the deep-learning-based method called IQT [42] reaches good results for all metrics on the LIVE database. However, its performances on the other two databases also degrade slightly. For our UniDASTN, it can make consistent performance among the three databases. This also illustrates the effectiveness of our UniDASTN in evaluating the qualities of various distorted images.

Furthermore, Figure 5 displays scatter plots of subjective MOS values versus objective prediction scores for our UniDASTN and the 14 FR-IQA methods on the TID2013 database, as well as the fitted curves. In this figure, the x-axis is the objective prediction score generated by a specific FR-IQA method, the y-axis is the MOS value, and each point on the plot represents one image in the database. The straight line is fitted by simple linear regression. If the correlation is strong, the points in the graph are tightly clustered around the curve. If the correlation is weak, the points are dispersed around the curve. As can be seen from the Figure 5, the points of our UniDASTN are much closer to the fitted curve than those of the compared FR-IQA methods, implying a stronger consistency with the subjective scores.

Fig. 5. Scatter plots of different FR-IQA methods on the TID2013 database.

To analyze the statistical significance of our UniDASTN relative to the compared methods, we employ the left-tailed F-test [21, 53] to decide whether or not an FR-IQA method is statistically superior to another FR-IQA method. We perform a series of hypothesis tests on the residuals between the subjective quality scores and the objective quality scores obtained from each FR-IQA method after nonlinear regression. Figures 6(a)–(c) show the statistically significant results on the LIVE, CSIQ, and TID2013 databases, respectively. In this figure, the value of H = 1 for the left-tailed F-test represents that the FR-IQA method in that row is significantly superior to the method in the corresponding column with a confidence level of more than 95% at the 0.05 significance level. The value of H = 0 means that the FR-IQA method in that row is not significantly superior to the FR-IQA method in the corresponding column. The symbol “-” means that the two FR-IQA methods in the corresponding row and column are statistically indistinguishable. From the statistical significance results of the LIVE, CSIQ, and TID2013 databases, it can be seen that our UniDASTN significantly outperforms some state-of-the-art FR-IQA methods and achieves the best performances on the three databases.

Fig. 6. Statistical significance tests on different databases.

4.5 Ablation Study

To assess the effectiveness of the network components used in the UniDASTN, we perform a detailed performance analysis of networks with various network component combinations.

Effectiveness of our channel attention module. To assess the effectiveness of the proposed channel attention module, we perform an ablation study on the CSIQ and TID2013 datasets. The results are shown in Table 5, where the best results in bold, and N1–N2 represent our networks with different usage strategies of channel attention module. Specifically, the N1 network means that the channel attention module is not used, and the N2 network means that the channel attention module is used. From Table 5, we can see that the N2 network produces the best PLCC and SRCC outcomes. These experimental results demonstrate that the channel attention module plays a significant role in our UniDASTN. The use of the channel attention module can result in more accurate quality prediction results.

Table 5.

Network	Strategy	CSIQ	TID2013
Network	Strategy	PLCC/SRCC	PLCC/SRCC
N1	×	0.983/0.979	0.975/0.971
N2	\(\surd\)	0.987/0.983	0.977/0.974

View Table

Table 5. Comparison of the Use of Channel Attention Module

Effectiveness of our loss function. To assess the effectiveness of the proposed joint loss function, we perform an ablation study on the CSIQ and TID2013 datasets. The results are shown in Table 6, where the best results in bold, and M1–M5 represent our networks with different loss functions. Specifically, in the experiment, the M5 network with MSE loss, bidirectional KL divergence loss, and rank loss produces the best PLCC and SRCC outcomes. These experimental results demonstrate that both the bidirectional KL divergence loss and the rank loss contribute to the performance improvement of our UniDASTN.

Table 6.

Network	Loss function			CSIQ	TID2013
Network	MSE	BKL	Rank	PLCC/SRCC	PLCC/SRCC
M1	\(\surd\)			0.979/0.972	0.971/0.967
M2		\(\surd\)		0.973/0.974	0.968/0.967
M3	\(\surd\)		\(\surd\)	0.981/0.975	0.972/0.968
M4	\(\surd\)	\(\surd\)		0.975/0.977	0.970/0.969
M5	\(\surd\)	\(\surd\)	\(\surd\)	0.987/0.983	0.977/0.974

View Table

Table 6. Comparison of Different Loss Functions

In addition, the typical visual results of PSNR, SSIM, DISTS, and our UniDASTN are demonstrated in Figure 7. It can be seen that the quality rank of the distorted image predicted by our UniDASTN is consistent with the MOS result. This verifies that our UniDASTN can distinguish distortion types with similar scores.

Fig. 7. For each distorted image, MOS represents the subjective rating, and the scores corresponding to each FR-IQA method represent its predicted results. The number in parentheses indicates the quality rank of the distorted image.

Effectiveness of our feature extraction backbone. To assess the effectiveness of our feature extraction backbones, we perform an ablation study on the CSIQ and TID2013 datasets. The results are shown in Table 7, where the best results are in bold, and M6–M8 represent our networks using different backbones. Specifically, the M6 network uses the ResNet50 as the feature extraction backbone, the M7 network uses the Inception-Resnet-V2 as the feature extraction backbone, and the M8 network uses the ViT-B\(/\)8 as the feature extraction backbone. It can be seen that the M8 network makes the best performances on the two datasets. These experimental results demonstrate that the selection of the ViT-B\(/\)8 is beneficial and effective in FR-IQA.

Table 7.

Network	Backbone	CSIQ	TID2013
Network	Backbone	PLCC/SRCC	PLCC/SRCC
M6	Resnet50	0.844/0.839	0.772/0.769
M7	Inception-Resnet-V2	0.966/0.961	0.958/0.950
M8	ViT-B/8	0.987/0.983	0.977/0.974

View Table

Table 7. Comparison of Different Feature Extraction Backbones

Effectiveness of our feature fusion technique. To assess the effectiveness of our feature fusion technique, we perform an ablation study on the CSIQ and TID2013 datasets using five different feature fusion techniques. Figure 8 shows the five techniques used for feature fusion, where Scheme (a) is the fusion technique reported in [43, 67], Scheme (b) is the fusion technique used in [58], Scheme (c) is the proposed feature fusion block, Scheme (d) is derived from Scheme (c) by adding the average pooling operation, and Scheme (e) is derived from Scheme (c) by replacing the operation of “concatenate” with “add”. Note that for the fusion techniques presented in Scheme (a) and Scheme (b), their outputs are the quality scores. These experimental results are provided in Table 8.

Fig. 8. Different fusion techniques, where (a) and (b) are the reported techniques, (c) is the proposed feature fusion module, (d) is derived from (c) by adding the pooling operation, and (e) is derived from (c) by replacing the operation of “concatenate” with “add”.

Table 8.

Scheme	CSIQ	TID2013
Scheme	PLCC/SRCC	PLCC/SRCC
scheme (a)	0.984/0.980	0.973/0.971
scheme (b)	0.980/0.975	0.970/0.966
scheme (c)	0.987/0.983	0.977/0.974
scheme (d)	0.983/0.980	0.972/0.969
scheme (e)	0.968/0.966	0.949/0.953

View Table

Table 8. Comparison of Different Fusion Techniques

Table 8 shows that the fusion technique of Scheme (c) makes the best PLCC and SRCC results. In a comparison of Scheme (a) and Scheme (c), an additional Transformer encoder used to fuse the extracted features can improve the prediction performance. This is because the additional transformer encoder can act as an aggregator to aggregate the information of each token. Moreover, in a comparison of Scheme (c) and Scheme (d) as well as Scheme (c) and Scheme (e), if the pooling operation is added or the operation of “concatenate” is replaced with “add”, the information of these tokens is inevitably lost and thus the FR-IQA performance decreases. These experimental results demonstrate that our used feature fusion technique can result in more accurate quality prediction.

Effectiveness of our siamese transformer network. To assess the effectiveness of our Siamese transformer network, we perform an ablation study on the CSIQ and TID2013 datasets. The results are shown in Table 9, where the best results are in bold, and the schemes 1–3 represent different input combinations used in the spatial attention module. There are three types of inputs used in our siamese transformer network, namely \(z_{0}^{ref}, z_{0}^{dist},\) and \(z_{0}^{diff}.\) If the number of inputs decreases to 2 (i.e., only \(z_{0}^{ref}\) and \(z_{0}^{diff}\) or \(z_{0}^{dist}\) and \(z_{0}^{diff}\) is input to the spatial attention module), there are two technical schemes in total. Specifically, the scheme 1 selects \(z_{0}^{ref}\) and \(z_{0}^{diff}\) as the inputs of the spatial attention module, and the scheme 2 uses \(z_{0}^{dist}\) and \(z_{0}^{diff}\) as the inputs of the spatial attention module. This means that, for the schemes 1–2, only one branch of the siamese transformer network is used in the spatial attention module. For comparison, the proposed scheme is denoted as the scheme 3 in which \(z_{0}^{ref}, z_{0}^{dist},\) and \(z_{0}^{diff}\) are all selected as the inputs of the spatial attention module. From Table 9, it can be seen that the scheme 3 makes the best results. These experimental results demonstrate that the features extracted by the two branches of our siamese transformer network are highly complementary. The scheme 1 and the scheme 2 only use the features of one branch. For the scheme 3, the features of two branches are both used and help to make good IQA whole performance.

Table 9.

Scheme	Inputs of SAM			CSIQ	TID2013
Scheme	\(z_{0}^{ref}\)	\(z_{0}^{dist}\)	\(z_{0}^{diff}\)	PLCC/SRCC	PLCC/SRCC
1	\(\surd\)		\(\surd\)	0.983/0.978	0.963/0.965
2		\(\surd\)	\(\surd\)	0.986/0.981	0.964/0.963
3	\(\surd\)	\(\surd\)	\(\surd\)	0.987/0.983	0.977/0.974

View Table

Table 9. Comparison of Different Input Numbers of Spatial Attention Module

To make easy understanding of the attention mechanism of Transformer [70], some visual examples of our Siamese transformer network are shown in Figure 9, where three distorted images and their corresponding attention maps of different heads are presented. In this experiment, the attention map of an image is obtained by averaging all attention weights in the self-attention module and resizing to the original image size. It can be found that different heads have different preferences in their attention maps. Through the collaboration of these heads, our UniDASTN can simulate the HVS to make an accurate quality prediction. Furthermore, visual examples of the two branches \(dist\sim diff\) (up branch) and \(ref\sim diff\) (down branch) of our Siamese transformer network are also illustrated. Figure 10 presents the attention maps of the Head 5, where the attention map of the up branch, the attention map of the down branch, and the attention map of two branches are all shown. It can be observed that the attentions of the two branches are spatially placed in different regions. This illustrates that the use of both branches can help in perceiving image quality.

Fig. 9. Visualization of attention maps.

Fig. 10. Attention maps of different branches.

Effectiveness of the inputs of siamese transformer network. To assess the effectiveness of our strategies for the input types, we perform an ablation study on the CSIQ and TID2013 datasets. The results are shown in Table 10, where the best results are in bold. Note that there are three embedding features feed into the siamese transformer network, namely \(z_{0}^{ref}, z_{0}^{dist},\) and \(z_{0}^{diff}\). Therefore, there are three schemes input of siamese transformer network in total. Specifically, for the Scheme 4, \(z_{0}^{diff}\) is input to the transformer encoder, and \(z_{0}^{dist}\) and \(z_{0}^{ref}\) are input to the transformer decoder. For the Scheme 5, \(z_{0}^{ref}\) is input to the transformer encoder, and \(z_{0}^{dist}\) and \(z_{0}^{diff}\) are input to the transformer decoder. For the Scheme 6, \(z_{0}^{dist}\) is input to the transformer encoder, and \(z_{0}^{ref}\) and \(z_{0}^{diff}\) are input to the transformer decoder. It can be seen that the Scheme 4 makes three best results and one second-best result. Therefore, the Scheme 4 is better than the Scheme 5 and the Scheme 6. These experimental results demonstrate that, as to the input of transformer encoder, the embedding features of “perceptual differences” are preferable to the reference or distortion embedding features. The aggregation of the embedding features of “perceptual differences” with the transformer encoder helps to improve IQA performance.

Table 10.

Scheme	Encoder			Decoder			CSIQ	TID2013
Scheme	\(z_{0}^{dist}\)	\(z_{0}^{ref}\)	\(z_{0}^{diff}\)	\(z_{0}^{dist}\)	\(z_{0}^{ref}\)	\(z_{0}^{diff}\)	PLCC/SRCC	PLCC/SRCC
4			\(\surd\)	\(\surd\)	\(\surd\)		0.987/0.983	0.977/0.974
5		\(\surd\)		\(\surd\)		\(\surd\)	0.987/0.984	0.970/0.960
6	\(\surd\)				\(\surd\)	\(\surd\)	0.986/0.981	0.972/0.969

View Table

Table 10. Comparison of Different Input Types of Spatial Attention Module

5 CONCLUSIONS

We have proposed a novel method for FR-IQA by UniDASTN. A key contribution is the spatial attention module composed of a Siamese Transformer network and a feature fusion block. It can focus on significant regions and thus effectively maps the perceptual differences between the reference and distorted images to a latent distance for distortion evaluation. The second contribution is the dual-attention strategy which exploits a squeeze-and-excitation network and the proposed spatial attention module to extract much more significant features for accurate IQA. The third contribution is the new joint loss function that combines the MSE, Bidirectional KL divergence, and rank order of quality scores. It can provide stable training and thus enables the proposed UniDASTN to learn visual perceptual image quality effectively. Many experiments on some standard IQA databases have been conducted to verify the IQA performance of the proposed UniDASTN. The comparison results have demonstrated that the proposed UniDASTN outperforms some state-of-the-art FR-IQA methods on the image databases of LIVE, CSIQ, TID2013, and PIPAL according to three well-known IQA metrics.

ACKNOWLEDGMENTS

Many thanks to the reviewers for their helpful suggestions.

REFERENCES

[1] Feng Ping and Tang Zhenjun. 2023. A survey of visual neural networks: Current trends, challenges and opportunities. Multimedia Systems 29, 2 (2023), 693–724.Google ScholarDigital Library
Reference
[2] Zhai Guangtao and Min Xiongkuo. 2020. Perceptual image quality assessment: A survey. Science China Information Sciences 63, 11 (2020), 1–52.Google ScholarCross Ref
Reference 1Reference 2
[3] Henderson John M.. 2003. Human gaze control during real-world scene perception. Trends in Cognitive Sciences 7, 11 (2003), 498–504.Google ScholarCross Ref
Reference
[4] Zhai Guangtao, Sun Wei, Min Xiongkuo, and Zhou Jiantao. 2021. Perceptual quality assessment of low-light image enhancement. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 4 (2021), Article 130, 1–24.Google ScholarDigital Library
Reference
[5] Shi Ran, Ma Jing, Ngan King Ngi, Xiong Jian, and Qiao Tong. 2022. Objective object segmentation visual quality evaluation: Quality measure and pooling method. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 3 (2022), Article 73, 1–19.Google ScholarDigital Library
Reference
[6] Chen Chenglizhao, Zhao Hongmeng, Yang Huan, Yu Teng, Peng Chong, and Qin Hong. 2021. Full-reference screen content image quality assessment by fusing multilevel structure similarity. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), Article 94, 1–21.Google ScholarDigital Library
Reference
[7] Liang Zhiyuan Chen, Yihua Chen, Xiaoping, and Tang Zhenjun. 2022. Multi-level feature aggregation network for full-reference image quality assessment. In Proceedings of the IEEE 34th International Conference on Tools with Artificial Intelligence. 861–867.Google Scholar
Reference
[8] Wang Zhou, Simoncelli Eero P., and Bovik Alan Conrad. 2003. Multiscale structural similarity for image quality assessment. In Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers, Vol. 2. 1398–1402.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[9] Huang Ziqing and Liu Shiguang. 2021. Perceptual hashing with visual content understanding for reduced-reference screen content image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 31, 7 (2021), 2808–2823.Google ScholarCross Ref
Reference
[10] Tang Zhenjun, Huang Ziqing, Yao Heng, Zhang Xianquan, Chen Lv, and Yu Chunqiang. 2018. Perceptual image hashing with weighted DWT features for reduced-reference image quality assessment. Computer Journal 61, 11 (2018), 1695–1709.Google ScholarCross Ref
Reference
[11] Yu Mengzhu, Tang Zhenjun, Zhang Xianquan, Zhong Bineng, and Zhang Xinpeng. 2022. Perceptual hashing with complementary color wavelet transform and compressed sensing for reduced-reference image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 32, 11 (2022), 7559–7574.Google ScholarDigital Library
Reference
[12] Liu Yutao, Gu Ke, Li Xiu, and Zhang Yongbing. 2020. Blind image quality assessment by natural scene statistics and perceptual characteristics. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 91, 1–20.Google ScholarDigital Library
Reference
[13] Yan Chenggang, Teng Tong, Liu Yutao, Zhang Yongbing, Wang Haoqian, and Ji Xiangyang. 2021. Precise no-reference image quality evaluation based on distortion identification. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), Article 110, 1–21.Google Scholar
Reference
[14] Zhang Chaofan, Huang Ziqing, Liu Shiguang, and Xiao Jian. 2022. Dual-channel multi-task CNN for no-reference screen content image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 32, 8 (2022), 5011–5025.Google ScholarDigital Library
Reference
[15] Wang Zhou and Bovik Alan Conrad. 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine 26, 1 (2009), 98–117.Google ScholarCross Ref
Reference 1Reference 2
[16] Gu Ke, Li Leida, Lu Hong, Min Xiongkuo, and Lin Weisi. 2017. A fast reliable image quality predictor by fusing micro-and macro-structures. IEEE Transactions on Industrial Electronics 64, 5 (2017), 3903–3912.Google ScholarCross Ref
Reference 1Reference 2
[17] Laparra Valero, Balle Johannes, Berardino Alexander, and Simoncelii Eero P.. 2016. Perceptual image quality assessment using a normalized Laplacian pyramid. In Proceedings of the Human Vision and Electronic Imaging. 43–48.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[18] Larson Eric Cooper and Chandler Damon Michael. 2010. Most apparent distortion: Full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19, 1 (2010), Article ID. 011006.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[19] Sheikh Hamid Rahim and Bovik Alan Conrad. 2006. Image information and visual quality. IEEE Transactions on Image Processing 15, 2 (2006), 430–444.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[20] Sheikh Hamid Rahim, Bovik Alan Conrad, and Veciana Gustavo De. 2005. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on Image Processing 14, 12 (2005), 2117–2128.Google ScholarDigital Library
Reference 1Reference 2
[21] Xue Wufeng, Zhang Lei, Mou Xuanqin, and Bovik Alan Conrad. 2013. Gradient magnitude similarity deviation: A highly efficient perceptual image quality index. IEEE Transactions on Image Processing 23, 2 (2013), 684–695.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[22] Zhang Lin, Shen Ying, and Li Hongyu. 2014. VSI: A visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image Processing 23, 10 (2014), 4270–4281.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[23] Zhang Lin, Zhang Lei, Mou Xuanqin, and Zhang David. 2011. FSIM: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing 20, 8 (2011), 2378–2386.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[24] Kang Le, Ye Peng, Li Yi, and Doermann David. 2014. Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1733–1740.Google ScholarDigital Library
Reference 1Reference 2
[25] Bosse Sebastian, Maniry Dominique, Müller Klaus-Robert, Wiegand Thomas, and Samek Wojciech. 2017. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27, 1 (2017), 206–219.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[26] Bosse Sebastian, Maniry Dominique, Wiegand Thomas, and Samek Wojciech. 2016. A deep neural network for image quality assessment. In Proceedings of the IEEE International Conference on Image Processing. 3773–3777.Google ScholarCross Ref
Reference
[27] Kim Jongyoo and Lee Sanghoon. 2017. Deep learning of human visual sensitivity in image quality assessment framework. In Proceedings of the IEEE/CVF Conference on Computer Vision & Pattern Recognition. 1676–1684.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[28] Ahn Sewoong, Choi Yeji, and Yoon Kwangjin. 2021. Deep learning-based distortion sensitivity prediction for full-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 344–353.Google ScholarCross Ref
Reference
[29] Seo Soomin, Ki Sehwan, and Kim Munchurl. 2021. A novel just-noticeable-difference-based saliency-channel attention residual network for full-reference image quality predictions. IEEE Transactions on Circuits and Systems for Video Technology 31, 7 (2021), 2602–2616.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[30] Ding Keyan, Ma Kede, Wang Shiqi, and Simoncelli Eero P.. 2022. Image quality assessment: Unifying structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2022), 2567–2581.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[31] Prashnani Ekta, Cai Hong, Mostofi Yasamin, and Sen Pradeep. 2018. PieAPP: Perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1808–1817.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[32] Zhang Richard, Isola Phillip, Efros Alexei A., Shechtman Eli, and Wang Oliver. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 586–595.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
[33] Goodfellow Ian J., Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2. 2672–2680.Google Scholar
Reference
[34] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Ren Jimmy S., and Dong Chao. 2020. PIPAL: A large-scale image quality assessment dataset for perceptual image restoration. In Proceedings of the European Conference on Computer Vision. 633–651.Google Scholar
Reference 1Reference 2
[35] Wang Zhou, Bovik Alan Conrad, Sheikh Hamid Rahim, and Simoncelli Eero P.. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[36] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.Google ScholarDigital Library
Reference
[37] Khan Salman, Naseer Muzammal, Hayat Munawar, Zamir Syed Waqas, Khan Fahad Shahbaz, and Shah Mubarak. 2022. Transformers in vision: A survey. ACM Computing Surveys 54, 10 (2022), Article 200, 1–41.Google Scholar
Reference
[38] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, and Houlsby Neil. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations.Google Scholar
Reference 1Reference 2Reference 3
[39] He Kaiming, Chen Xinlei, Xie Saining, Li Yanghao, Dollár Piotr, and Girshick Ross. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16000–16009.Google ScholarCross Ref
Reference
[40] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. 2022. Swin transformer V2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12009–12019.Google Scholar
Reference
[41] Wang Zhendong, Cun Xiaodong, Bao Jianmin, Zhou Wengang, Liu Jianzhuang, and Li Houqiang. 2022. Uformer: A general U-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17683–17693.Google ScholarCross Ref
Reference
[42] Cheon Manri, Yoon Sung-Jun, Kang Byungyeon, and Lee Junwoo. 2021. Perceptual image quality assessment with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 433–442.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[43] Golestaneh S. Alireza, Dadsetan Saba, and Kitani Kris M.. 2022. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3989–3999.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[44] Ke Junjie, Wang Qifei, Wang Yilin, Milanfar Peyman, and Yang Feng. 2021. MUSIQ: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5148–5157.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[45] You Junyong and Korhonen Jari. 2021. Transformer for image quality assessment. In Proceedings of the IEEE International Conference on Image Processing. 1389–1393.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[46] Avcibas Ismail, Sankur Bulent, and Sayood Khalid. 2002. Statistical evaluation of image quality measures. Journal of Electronic Imaging 11, 2 (2002), 206–223.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[47] Ma Siwei, Gao Wen, and Lu Yan. 2005. Rate-distortion analysis for H. 264/AVC video coding and its application to rate control. IEEE Transactions on Circuits and Systems for Video Technology 15, 12 (2005), 1533–1544.Google ScholarDigital Library
Reference
[48] Jian Liang, Dapeng Hu, Jiashi Feng, and Ran He. 2022. DINE: Domain adaptation from single and multiple black-box predictors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7993–8003.Google Scholar
Reference
[49] Benesty Jacob, Chen Jingdong, Huang Yiteng, and Cohen Israel. 2009. Pearson Correlation Coefficient. Springer, 1–4.Google Scholar
Reference
[50] Zar Jerrold H.. 2005. Spearman rank correlation. In Encyclopedia of Biostatistics, Peter Armitage and Theodore Colton (Eds.). Vol. 7, Wiley, Hoboken, NJ.Google ScholarCross Ref
Reference
[51] Abdi Hervé. 2007. The Kendall rank correlation coefficient. In Encyclopedia of Measurement and Statistics, Neil J. Salkind (Ed.). Sage, 508–510.Google Scholar
Reference
[52] Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, and Chung Chieh Jay Kuo. 2015. Image database TID2013: Peculiarities, results and perspectives. Signal Processing: Image Communication 30 (2015), 57–77.Google Scholar
Reference
[53] Sheikh Hamid Rahim, Sabir Muhammad F., and Bovik Alan Conrad. 2006. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15, 11 (2006), 3440–3451.Google ScholarDigital Library
Reference 1Reference 2
[54] Lin Hanhe, Hosu Vlad, and Saupe Dietmar. 2019. KADID-10k: A large-scale artificially distorted IQA database. In Proceedings of the 11th International Conference on Quality of Multimedia Experience. 1–3.Google ScholarCross Ref
Reference
[55] Bae Sung Ho and Kim Munchurl. 2016. A novel image quality assessment with globally and locally consilient visual quality perception. IEEE Transactions on Image Processing 25, 5 (2016), 2392–2406.Google ScholarDigital Library
Reference 1Reference 2
[56] Blau Yochai and Michaeli Tomer. 2018. The perception-distortion tradeoff. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6228–6237.Google ScholarCross Ref
Reference 1Reference 2
[57] Damera-Venkata Niranjan, Kite Thomas D., Geisler Wilson S., Evans Brian L., and Bovik Alan Conrad. 2000. Image quality assessment based on a degradation model. IEEE Transactions on Image Processing 9, 4 (2000), 636–650.Google ScholarCross Ref
Reference 1Reference 2
[58] Guo Haiyang, Bin Yi, Hou Yuqing, Zhang Qing, and Luo Hengliang. 2021. IQMA network: Image quality multi-scale assessment network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 443–452.Google ScholarCross Ref
Reference
[59] Lao Shanshan, Gong Yuan, Shi Shuwei, Yang Sidi, Wu Tianhe, Wang Jiahao, Xia Weihao, and Yang Yujiu. 2022. Attentions help CNNs see better: Attention-based hybrid image quality assessment network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1140–1149.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[60] Zhang Lin and Li Hongyu. 2012. SR-SIM: A fast and high performance IQA index based on spectral residual. In Proceedings of the 19th IEEE International Conference on Image Processing. 1473–1476.Google ScholarCross Ref
Reference 1Reference 2
[61] Liu Anmin, Lin Weisi, and Narwaria Manish. 2012. Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing 21, 4 (2012), 1500–1512.Google ScholarDigital Library
Reference 1Reference 2
[62] Ma Chao, Yang Chih-Yuan, Yang Xiaokang, and Yang Ming-Hsuan. 2017. Learning a no-reference quality metric for single-image super-resolution. Computer Vision and Image Understanding 158 (2017), 1–16. https://www.sciencedirect.com/science/article/abs/pii/S107731421630203X.Google ScholarDigital Library
Reference 1Reference 2
[63] Mittal Anish, Soundararajan Rajiv, and Bovik Alan Conrad. 2013. Making a “Completely Blind” image quality analyzer. IEEE Signal Processing Letters 20, 3 (2013), 209–212.Google ScholarCross Ref
Reference 1Reference 2
[64] Mittal Anish, Moorthy Anush K., and Bovik Alan Conrad. 2011. Blind/referenceless image spatial quality evaluator. In Proceedings of the 45th Asilomar Conference on Signals, Systems and Computers. 723–727.Google ScholarCross Ref
Reference 1Reference 2
[65] Pei Soo-Chang and Chen Li-Heng. 2015. Image quality assessment using human visual DOG model fused with random forest. IEEE Transactions on Image Processing 24, 11 (2015), 3282–3292.Google ScholarDigital Library
Reference 1Reference 2
[66] Varga Domonkos. 2020. Composition-preserving deep approach to full-reference image quality assessment. Signal, Image and Video Processing 14, 6 (2020), 1265–1272.Google ScholarCross Ref
Reference 1Reference 2
[67] Zhang Hainan, Meng Fang, and Han Yawen. No-reference image quality assessment based on a multi-feature extraction network. In Proceedings of the 2nd International Conference on Image, Video and Signal Processing. 81–85.Google Scholar
Reference
[68] Zhang Lin, Zhang Lei, and Mou Xuanqin. 2010. RFSIM: A feature based image quality assessment metric using Riesz transforms. In Proceedings of IEEE International Conference on Image Processing. 321–324.Google ScholarCross Ref
Reference 1Reference 2
[69] Wang Zhou and Bovik Alan Conrad. 2002. A universal image quality index. IEEE Signal Processing Letters 9, 3 (2002), 81–84.Google ScholarCross Ref
Reference 1Reference 2
[70] He Yangji, Liang Weihan, Zhao Dongyang, Zhou Hong-Yu, Ge Weifeng, Yu Yizhou, and Zhang Wenqiang. 2022. Attribute surrogates learning and spectral tokens pooling in transformers for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9119–9129.Google ScholarCross Ref
Reference

Index Terms

Unifying Dual-Attention and Siamese Transformer Network for Full-Reference Image Quality Assessment
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods

Recommendations

A Full-Reference Image Quality Assessment Model Based on Quadratic Gradient Magnitude and LOG Signal
Image and Graphics
Abstract
Image quality assessment aims at estimating the subject quality of images and builds models to high efficiently evaluate the perceptual quality of the image for many applications. Because the human visual system (HVS) is highly sensitive to ...
Read More
Reduced-reference quality assessment of image super-resolution by energy change and texture variation
Highlights
- In this paper, we propose a novel reduced-reference quality assessment metric for image super-resolution (RRIQA-SR) by two components: the energy change in ...
Abstract
In this paper, we propose a novel reduced-reference quality assessment metric for image super-resolution (RRIQA-SR) based on the low-resolution (LR) image information. With the pixel correspondence, we predict the perceptual similarity ...
Read More
Attention integrated hierarchical networks for no-reference image quality assessment
Abstract
Quality assessment of natural images is influenced by perceptual mechanisms, e.g., attention and contrast sensitivity, and quality perception can be generated in a hierarchical process. This paper proposes an architecture of Attention ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 6
November 2023
858 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3599695
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Copyright © 2023 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 July 2023
- Online AM: 18 May 2023
- Accepted: 9 May 2023
- Revised: 25 March 2023
- Received: 21 December 2022
Published in tomm Volume 19, Issue 6

Check for updates
Author Tags
Transformer
siamese network
dual-attention
image quality assessment (IQA)
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 1,823
  Total Downloads
- Downloads (Last 12 months)1,823
- Downloads (Last 6 weeks)200
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unifying Dual-Attention and Siamese Transformer Network for Full-Reference Image Quality Assessment

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 Traditional IQA Methods

2.2 CNN-Based IQA Methods

2.3 Transformer-Based IQA Methods

3 PROPOSED METHOD

3.1 Feature Extraction Module

3.2 Channel Attention Module

3.3 Feature Embedding Module

3.4 Spatial Attention Module

3.5 Prediction Head

3.6 Loss Function

4 EXPERIMENTAL RESULTS

4.1 Implementation Details of Our UniDASTN

4.2 Evaluation Criteria

4.3 Datasets

4.4 Performance Comparison

4.5 Ablation Study

5 CONCLUSIONS

ACKNOWLEDGMENTS

REFERENCES

Cited By

Index Terms

Recommendations

A Full-Reference Image Quality Assessment Model Based on Quadratic Gradient Magnitude and LOG Signal

Reduced-reference quality assessment of image super-resolution by energy change and texture variation

Attention integrated hierarchical networks for no-reference image quality assessment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media