Top

Neural Processing Letters

Published in:

Open Access 01-02-2024

RII-GAN: Multi-scaled Aligning-Based Reversed Image Interaction Network for Text-to-Image Synthesis

Authors: Haofei Yuan, Hongqing Zhu, Suyi Yang, Ziying Wang, Nan Wang

Published in: Neural Processing Letters | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

The text-to-image (T2I) model based on a single-stage generative adversarial network (GAN) has significantly succeeded in recent years. However, the generation model based on GAN has two disadvantages: the generator does not introduce any image feature manifold structure, which makes it challenging to align the image and text features. Another is the image’s diversity; the text’s abstraction will prevent the model from learning the actual image distribution. This paper proposes a reversed image interaction generative adversarial network (RII-GAN), which consists of four components: text encoder, reversed image interaction network (RIIN), adaptive affine-based generator, and dual-channel feature alignment discriminator (DFAD). RIIN indirectly introduces the actual image distribution into the generation network, thus overcoming the problem that the network lacks the learning of the actual image feature manifold structure and generating the distribution of text-matching images. Each adaptive affine block (AAB) in the proposed affine-based generator can adaptively enhance text information, establishing an updated relation between original independent fusion blocks and the image feature. Moreover, this study designs a DFAD to capture important feature information of images and text in two channels. Such a dual-channel backbone improves semantic consistency by utilizing a particular synchronized bi-modal information extraction structure. We have performed experiments on publicly available datasets to prove the effectiveness of our model.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Text-to-image (T2I) synthesis is an essential task involving natural language processing and computer vision, which aims to generate semantic consistent images from given text descriptions. Most existing T2I frameworks utilize generative adversarial network (GAN) [5] to synthesize images. Although it has been widely recognized for generating visually realistic images, the T2I model based on GAN still faces some challenges.

Firstly, the training instability of GAN leads to the mode collapse and the lack of diversity of generated images. To address the issues above, recent approaches have developed stacked GANs to generate multi-scale images in a low-to-high resolution manner to supervise image synthesis, such as AttnGAN [32], DM-GAN [41], and SD-GAN [35], etc. These methods have been proven effective in synthesizing high-resolution images and preventing model collapse. However, based on a multi-stage structure, each generator refines the previous stage’s results by improving the image resolution and enriching more details. Hence, the final output depends heavily on the image generated in the initial step. To this end, Tao et al. [29] explored a simplified single-stage end-to-end generative model to optimize the training trajectory of GANs.

However, efficiently aligning features between image and text presents a significant challenge. Some methods, such as MirrorGAN [19], SuperGAN [3], etc., utilize the extra inverse inference model to caption generated images. This architecture enhances the generator’s capacity to generate semantically consistent images by effectively capturing visual-textual consistency information throughout training. Although this strategy is straightforward, it has certain drawbacks. Notably, it necessitates the pre-training of the image-to-text (I2T) model due to its significant size and complexity, making it challenging to train compared to some end-to-end models.

Another challenge is effectively implementing cross-model transformation in the case of a large semantic gap between text and image domains. Recently, some researchers have considered two strategies, cross-model attention [32] and affine transformation [29], to alleviate this problem. For the former, cross-modal attention calculation usually occurs between the word embedding and the current layer feature map. Therefore, expensive computing costs hinder the model from generating higher-resolution images and performing multi-scale depth fusion. For the latter, the affine transformation is a conditional batch normalization (CBN) [30] layer, which was first introduced in the T2I task of SD-GAN [35]. However, it did not perform multi-scale deep fusion due to the limitations of the stacked network architecture. Based on their work, Tao et al. [29] used affine transformation from sentence embedding to image to achieve multi-scale depth fusion, in which each fusion module is merged independently at different scales. As we know, too many independent modules will inevitably lead to training conflicts. In addition, the fusion module may cause the problem of overlapping or missing input text information because it does not accept the forward calculated features from the generator and lacks self-adaptability when selecting input text information.

The last challenge is that if the T2I model lacks an understanding of the image feature manifold structure, and only relies on the game between the generator and the discriminator to learn the real image distribution, the T2I model may converge slowly.

This paper proposes a novel end-to-end GAN-based T2I model called reversed image interaction GAN (RII-GAN), which has a single-stage generation structure. RII-GAN indirectly introduces the authentic image feature manifold structure into the generator using feature alignment techniques in the proposed reversed image interaction network (RIIN). Specifically, the architecture of this RIIN is equivalent to the reversed feature transformation of the generator, and the feature alignment constraint conducts in the multi-scale. Figure 1 shows the framework of the conventional GAN-based T2I network and the proposed RII-GAN with the reversed network. RIIN is a symmetrical counterpart to the generator’s low-to-high resolution hierarchical structure, which is both lightweight and adaptable. Discarding the consuming strategy of feature alignment within the individual text or image domains, it operates within the intermediate domains between text and image throughout the generation process. This unique reversed interaction mechanism reduces the reliance of the generator on text descriptions and provides it with a supplement pathway to access features of images during training. Meanwhile, the mutual optimization of the generator and RIIN, which begins with a weaker capacity and progressively strengthens alongside the improvement of the generator and discriminator, adapts to the intermediate mixed domain and dynamically evolves throughout the adversarial training process. Moreover, we propose a novel adaptive affine-based generator composed of various adaptive affine blocks (AAB) at different resolutions as text-image fusion blocks. AAB will enhance text information before the affine transformation. It includes an adaptive block to adapt the input text embedding by driving the information exchange between the text feature information and the feature maps from the current generator forward computation process to avoid overlapping and missing text information. AAB further alleviates the problem of the semantic gap between text and image domains by rebuilding the text embeddings. Finally, we correspondingly enhance the discriminator structure to facilitate adversarial training for the generator that has strengthened generation capability with the introduced reversed interactive scheme. Therefore, this study proposes the dual-channel feature alignment discriminator (DFAD). It achieves simultaneous refinement of text and image features through bi-modal feature extractions in two channels. It guides the discriminator to extract semantically consistent bi-modal features through feature alignment operations. Distinct from mainstream conditional discriminators that concatenate features, our proposed DFAD significantly enhances the processing capability of bimodal features. Consequently, it better aligns images and text features, substantially improving semantic consistency.

The main contributions of this work are summarized as follows.

We propose a reversed image interaction network to help the generator learn the authentic image feature manifold structure through multi-scaled feature alignment constraints. This alleviates the reliance of the generator on input text description and provides an alternative generation scheme for GAN-based multimodal generation tasks.
We explore an adaptive affine-based (AAB) generator which designs an adaptive block and establishes an adaptive updating scheme for text-image fusion.
We propose a dual-channel feature alignment discriminator, which allows simultaneous extraction of textual and visual features in each modality and implements an advanced feature alignment strategy to improve semantic consistency.

2.1 GAN-Based Text-to-Image Synthesis

Reed et al. [22] first demonstrated that conditional GAN (cGAN) [15] can synthesize credible images from text descriptions. To generate the $256\times 256$ resolution images, Zhang et al. [36] developed a stacked structure GAN (StackGAN) by several generator-discriminator pairs. Xu et al. [32] extended it and introduced cross-modal spatial attention to calculating the attention matrix for text-to-image fusion. Zhu et al. [41] explored a dynamic memory scheme for improving T2I fusion process. Li et al. [10] introduced channel attention to exploring the relationship between the image feature channel and word context vector. To improve the semantic consistency between the generated images and text, Yin et al. [35], and Tan et al. [27] utilized Siamese structures with contrastive learning loss to enhance semantic consistency. MirrorGAN [19] leverages an additional inverse inference model to caption the generated images, enhancing the generator’s ability to create semantically consistent images by capturing visual-textual consistency. However, it requires pre-training of the I2T model. Due to its substantial size and complexity, the method is relatively challenging in training and potentially less efficient than end-to-end models. To improve the generation quality, Yang et al. [33] explored the semantic common between different sentences describing the same image by utilizing multiple single-sentence generation and multi-sentence discrimination (SGMD) modules. Considering the properties of image complexity and text generality, Tan et al. [28] designed a regularized GAN framework to distinguish the critical and unimportant information of generated images, making the network more inclined to focus on the necessary semantic parts of the feature map.

Recently, some works have simplified the stack structure into a single generator-discriminator pair. For example, Zhang et al. [38] chose the one-stage generative structure as the backbone for XCM-GAN through five contrastive learning pairs between image, sentence, and word embeddings. Tao et al. [29] also chose a single-stage GAN-based T2I model to achieve text-image fusion. They employed sentence embedding as conditional input, making fusion more effective. For making the model pay more attention to semantic focus areas, inspired by [29], Liao et al. [12] introduced a weakly-supervised mask prediction for the fusion module. In [11], Li et al. provided the image feature manifold structure directly to the generator by a memory bank, which previously stores the authentic image feature maps. Unlike them, the proposed RII-GAN improves the quality of generated images by exploiting a combination of a simple single-stage generation structure and a new reversed image interaction mechanism.

2.2 Text-to-Image Fusion

The early T2I model achieved text-to-image fusion by directly concatenating text and image feature maps. Subsequently, some works such as AttnGAN [32] and StackGAN++ [37] used the cross-modal attention mechanism to improve the fusion effect. However, the introduction of an attention module inevitably increases computational consumption. To solve this problem, Yin et al. [35] introduced condition batch normalization (CBN) into text-image fusion. It fuses text and image features by disentangling text embedding and performing the affine transformation. Afterward, Tao et al. [29] proposed a deep fusion module, which achieves an excellent result by adopting sentence embedding to achieve multi-scale fusion. Liao et al. [12] used masks to pay more attention to the generation of crucial feature regions, thus improving the fusion effect. Ye et al. [34] found that mutually independent fusion modules conflict with each other during training. To address this issue, they use a bidirectional long short-term memory network (BiLSTM) [25] to establish a long-term dependency on the independent fusion blocks, which helps to reduce the difficulty of model training. Our framework introduces an adaptive affine transformation, which also provides a solution to achieve long-term dependency of fusion blocks while implementing textual information augmentation by adaptively considering the feature maps computed forward by the current generator.

2.3 Feature Alignment in Text and Image

Image-text and image-image feature alignment schemes are the research focus of T2I generation tasks. Xu et al. [32] utilized BiLSTM [25] and Inception-v3 model [26] to build the deep attentional multimodal similarity model (DAMSM), which is applied to vary text embedding and image features to the same dimension and perform feature alignment to improve the image-text consistency. To reduce the randomness of the generated images, Li et al. [10] developed perceptual loss to align the features of the generated and authentic images. Yin et al. [35] found that the T2I model based on GAN is challenging to distinguish the subtle differences between each representation of the same image. Therefore, they use Siamese structure to align different generated images with corresponding text descriptions but the same reference object. Zhang et al. [38] improved their work using five pairs of contrastive learning. Afterward, Tan et al. [28] argued that the crucial to generating high-quality images is whether the generator can identify critical features in the forward calculation process. Therefore, they used the semantic disentangling module (SDM) to disentangle the generated foreground image features to perform feature alignment with real image features. We propose a novel RIIN for multi-scale image feature alignment with the generator. It can provide rich image distribution information to the generator while enhancing semantic consistency and mitigating mode collapse.

3 Methodology

The proposed reverse interaction scheme with RIIN mutual constraints and optimization with the generator greatly enhance the generation ability of this network. To facilitate adversarial training of the enhanced generator, we correspondingly design a more powerful discriminator. This study adopts a single-stage generative structure as the basic framework illustrated in Fig. 2, which contains a text encoder, an adaptive affine-based generator, a dual-channel feature alignment discriminator, and a reversed image interaction network. Text encoder is a pre-trained BiLSTM [25] for text representation learning, which is responsible for extracting sentence vectors. The text encoder is a pre-trained BiLSTM [25] for text representation learning, responsible for extracting sentence vectors. It is trained with the help of DAMSM [32]. The DAMSM module maps images and sentences to a shared semantic space, measuring image-text similarity at an identical dimension. This method enhances the precision of the alignment process between the BiLSTM and the generated image features during training. As for the generator, we select six UP Blocks and 1 UP Block0 for obtaining $256\times 256$ images. Each UP Block contains two AABs, which will be introduced in Sect. 3.2. The UP Block0 accepts the noise vector sampled from the normal distribution. The proposed RIIN contains five Down Blocks. It reversely transmits the feature information of the actual image to the generator through feature alignment of the inverse transformed feature map, thus avoiding model collapse. The proposed DFAD conducted adversarial training with the generator and RIIN.

3.1 Reversed Image Interaction Network: RIIN

The optimization goal of GAN is to achieve the Nash equilibrium of the generator and discriminator, which usually leads to mode collapse due to unstable training. Although some methods [1, 2, 20] attempt to improve the stability of GAN training, few reports focus on modifying network architecture. To maximize the prior knowledge of text-image training pairs and let T2I models eliminate the dilemma that lacks the learning opportunities of image feature manifold structure, we propose the reversed image interaction network in this work. This study has changed the basic structure of the generator-discriminator pair in traditional GAN by embedding the proposed RIIN so that the RIIN can indirectly provide actual image prior knowledge. RIIN incorporates a custom-designed convolutional network, which synchronously shapes with the generator’s output during the forward computation process. This unique feature allows modification of the output value of the intermediate feature map via affine transformation. The alignment feature in our model is domain-agnostic, focusing instead on the intermediate domains between text and image throughout the image generation process. RIIN will update the generator by reversed transformation for cooperation with the GAN mechanism. Thus, during the training process, RIIN can provide a stable image manifold structure based on the capabilities of the improved generator and discriminator. If the effect of RIIN is only to provide image feature manifolds, pre-training models with fixed parameters migrated from some large datasets, such as ResNet-101 [6], ViT-L/16 [4], can improve the ability to extract image feature manifolds at the early stage of training. However, the large pre-trained model is built on a single image modality. If the generation network features are forced to align with those of the pre-trained model, the results may not be satisfactory. Since the T2I model generator involves cross-modal transformation, there must be an intermediate domain in the transformation process from the text domain to the image domain. The intermediate domain is a potential spatial domain with text and image manifolds but does not favor either side. In the training stage, the intermediate domain cannot be controlled through external supervision. The intermediate domain of T2I generation is part of the latent space learned by GAN. This is the main reason many works use the GAN mechanism for implicit learning of image distributions. Monitoring the learning results of the intermediate domain is challenging because we need valid labels and metrics. Therefore, traditional GANs are prone to mode collapse, gradient disappearance, etc. Fortunately, in the training phase, we can easily access three information sources related to the intermediate domain: random noise distribution, text distribution, and actual image distribution. With the help of the GAN mechanism, we can obtain weakly-supervised labels from the proposed RIIN for intermediate domain learning by effectively using these three information sources. We develop the CBNs-based Affine Block in RIIN as shown in Fig. 2. It aims to make the network adaptively close to the intermediate domain representation and provide image information of feature maps from large to small. Moreover, this is a cross-mode task, the information contained in the generator’s feature map at different resolutions should be different. A larger feature map ($128\times 128$) carries more image information. Correspondingly, a smaller feature map ($8\times 8 $) takes more text information and noise distribution. The generator and RIIN are mutually constrained and optimized for each other.

In the proposed RIIN, the actual image $\hat{ {I} }$($256\times 256$) is sent to a convolution layer to obtain a $256\times 256$ feature map $h_0=conv(\hat{I}$). $h_0$ will then be conveyed to five Down Blocks in turn. To quickly approach the intermediate domain features in the generator and make better use of the real image prior, as shown in the lower right corner of Fig. 2, each Down Block has four inputs: the output feature $ h _ r $ from the upper Down Block, the real image $\hat{ {I} ^{M \times M}}$, the sentence vector s, and the noise distribution z. Because we want to approach the implicit space with text feature, image feature, and input random noise distribution, this study considers incorporating all these three information source feature components and one upper-level feature into Down Block. The affine layer is a CBN framework inspired by [29], which takes the sentence vector s as input and fuses the textual information into the feature map. Several parameters of CBN are defined as follows.

$$\begin{aligned} \begin{aligned} {\gamma _r}&= \mathrm{{MLP}}1(s,z),\quad {\beta _r} = \mathrm{{MLP}}2(s,z),\quad \mathrm{{ }}{{\hat{h}}_r}&= {\gamma _r} \times {h'_r} + {\beta _r}, \end{aligned} \end{aligned}$$

(1)

where $ {\gamma _r} \in {{ {\mathbb {R}} }^{{N \times C_{in}}}} $ is the channel-wise scaling parameter, ${\beta _r} \in {{ {\mathbb {R}} }^{{N \times C_{in}}}}$ is the shifting parameter. $ {h'_r} \in {{ {\mathbb {R}} }^{N \times {C_{in}} \times H \times W}} $ and $ {{\hat{h}}_r} \in {{ {\mathbb {R}} }^{N \times {C_{in}} \times H \times W}} $ are the input and output of the affine layer. $ {\hat{h}}_r $ will be used to concatenate with the real image feature map to obtain the final Down Block output feature $ {h_r} \in {{ {\mathbb {R}} }^{N \times {C_{out}} \times H \times W}} $. The dimensions of RIIN output features are respectively $8 \times 8$, $32\times 32$ and $128\times 128 $, and these features will align with generator features of the same size. The feature alignment loss computes the mean square error between features for shortening the feature distance. This loss is used to establish a bi-directional constraint between the generator and RIIN, which is defined by

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{{G_f}}} = \Vert h_r^{8 \times 8} - h_g^{8 \times 8}{\Vert _2}+{\lambda _1}\Vert h_r^{32 \times 32}- h_g^{32 \times 32}{\Vert _2} + {\lambda _2}\Vert h_r^{128 \times 128} - h_g^{128 \times 128}{\Vert _2}, \end{aligned} \end{aligned}$$

(2)

where $ h_r^{M \times M} $ and $ h_g^{M \times M} $ represent the $ M \times M $ feature map of the RIIN and the generator, respectively. ${\Vert \cdot \Vert _2}$ represents the mean square error loss. Later experiments will discuss the impact of two hyper-parameters $\lambda _1 $ and $\lambda _2$.

3.2 Adaptive Affine-Based Generator

According to [34], the isolated fusion modules conflict with each other, increasing the training difficulty. Therefore, this study designs an adaptive affine-based generator consisting of six UP Blocks and one UP Block0. We developed two AABs for each UP Block. When the proposed adaptive affine-based generator performs forward computation, the text embedding will be updated by the Adaptive Block in the UP Block (Fig. 2), and input into the subsequent affine layers. Figure 3 shows the architecture of Adaptive Block. This adaptive structure links isolated affine layers to the network backbone. It provides the affine transformation with the path to obtain the current layer feature and establishes long-term dependence on other network modules to reduce the training difficulty. The information enhancement of the AAB is mainly completed in the Adaptive Block. As shown in Fig. 3, the Adaptive Block has two inputs: the current layer feature map $ h_{g}^{{'M} \times M} $ and the sentence vector s. The convolution kernel size is the same as the dimension of the feature map. $h_{g}^{'1} \in {{{\mathbb {R}} }^{N\times 100}} $ is the depth-wise convolution transformation output. The semantic condition ${s_{con}} \in {{ {\mathbb {R}} }^{N \times 356}} $ is the result of concatenation between $ h_{g}^{'1} $ and s, which will convey into the multilayer perceptron (MLP). We use Tanh as the final activation function to compute an attention map within values in (-1, 1). Finally, we adopt a residual structure to acquire the adaptive information-enhancing sentence vector $s'$ using

$$\begin{aligned} \begin{aligned} s' = s \times {s_{attn}} + s. \end{aligned} \end{aligned}$$

(3)

$s'$ then is used to conduct the affine transformation. The channel-wise scaling parameter ${\gamma _g} \in {{ {\mathbb {R}} }^{{N\times C_{in}}}}$ and the shifting parameter ${\beta _g} \in {{ {\mathbb {R}} }^{{N\times C_{in}}}}$ are defined as follows

$$\begin{aligned} \begin{aligned} {\gamma _g}&= \mathrm{{MLP}}1(s',z), \quad {\beta _g} = \mathrm{{MLP}}2(s',z),\quad \mathrm{{ }}{{\hat{h}}_g}&= {\gamma _g} \times {h'_g} + {\beta _g} \end{aligned} \end{aligned}$$

(4)

where $ {h'_g} \in {{ {\mathbb {R}} }^{N \times {C_{in}} \times H \times W}} $ and $ {{\hat{h}}_g} \in {{ {\mathbb {R}} }^{N \times {C_{in}} \times H \times W}} $ are the input and output of affine layer, respectively.

3.3 Dual-Channel Feature Alignment Discriminator: DFAD

This study proposes a dual-channel feature alignment discriminator to enhance the semantic consistency between generated images and text descriptions. Since the major role of the discriminator in the T2I task is to capture the feature matching degree between the information of text and image feature map, the model should perform internal unwrapping and alignment operations on image and text features. The problem of extracting the matching information of two modes should be related. For example, extracting matched text information with fixed image features inevitably introduces the independent semantic part (redundant data) of image features. And extracting image-related features in the case of fixed text embedding will also have similar problems. Therefore, the extraction of matching information of the two modes should be carried out jointly to consider each other. Yin et al. [35] utilized Siamese structure and contrastive loss to forcibly improve the matching degree of image feature and text embedding features. Their strategy consumed enormous computing resources. Therefore, this study designs a new discriminator framework to improve the disentangling capabilities of the bi-modal information embedding of image and text. In the proposed discriminator, we need image and text feature information that can match each other. Different objects need to match different key information, which leads to the discriminator in the feature disentangling obtain some unmatched features if without feature selecting performance. So, feature alignment is required to guide feature extraction. Each channel of the dual-channel consists of the disentangling feature of one mode (text/image) and the original feature of the other (image/text). To improve the ability of the model to select correctly matched text and image features, we perform feature alignment constraint operations on the final disentangling fused features in two channels, as shown in Fig. 4.

The structure of the proposed DFAD is shown in Fig. 4. Firstly, one obtains the initial image feature map P through a series of Residual Down blocks. The P and the extended sentence embedding $ S^{8 \times 8} $ are concatenated and sent respectively to the multilayer perceptron of two different channels (MPL1 and MPL2) to obtain the upper channel’s sentence attention map $M_S$ and the lower channel’s image spatial attention map $M_I$. We utilize $M_S$ and $M_I$ to obtain two disentangling feature maps $ {\hat{h}}_{d\_image}^{8 \times 8} $ and $ {\hat{h}}_{d\_text}^{8 \times 8} $ using

$$\begin{aligned} \begin{aligned} {M_S} = MLP1(h_{d\_in}^{8 \times 8}),\quad {M_I} = MLP2(h_{d\_in}^{8 \times 8})\\ {\hat{h}}_{d\_image}^{8 \times 8} = {M_S} \times {S^{8 \times 8}}, \quad {\hat{h}}_{d\_text}^{8 \times 8} = {M_I} \times P \end{aligned} \end{aligned}$$

(5)

Then, we respectively send the concatenation between $ {\hat{h}}_{d\_image}^{8 \times 8} $ and $ S^{8 \times 8} $, $ {\hat{h}}_{d\_text}^{8 \times 8} $ and P to two different residual blocks to obtain the feature map $ h_{d\_{f_1}}^{4 \times 4} $ and $ h_{d\_{f_2}}^{4 \times 4} $. These feature maps are used to align features. This study uses the mean square error loss $ {\mathcal{L}_{{D_f}}} $ to represent the feature distance.

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{{D_f}}} = {\left\| {h_{d\_{f_1}}^{4 \times 4} - h_{d\_{f_2}}^{4 \times 4}} \right\| _2} \end{aligned} \end{aligned}$$

(6)

Since the discriminator adopts a dual-channel structure, the result of the forward computation is the average of the dual-channel output, as shown in Fig. 4. This study uses the adversarial hinge loss $ {\mathcal{L}_{{D_{adv}}}} $ [13] and MA-GP loss $ {\mathcal{L}_{{D_{MA - GP}}}} $ [29] for the proposed discriminator.

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal{L}_{{D_{adv}}}} =&- {{\mathbb {E}}_{{\hat{I}} \sim {P_r}}}[\min (0, - 1 + D(G({\hat{I}},s)))] - {{\mathbb {E}}_{G(z) \sim {P_g}}}[\min (0, - 1 - D(G(z),s))]\\&- {{\mathbb {E}}_{{\hat{I}} \sim {P_{mis}}}}[\min (0, - 1 - D({\hat{I}},s))] \end{aligned} \end{aligned}$$

(7)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal{L}_{{D_{MA - GP}}}} = {{\mathbb {E}}_{{\hat{I}} \sim {P_r}}}[{({|| {{\nabla _{{\hat{I}}}}D({\hat{I}},s)} ||_2} + {|| {{\nabla _s}D({\hat{I}},s)} ||_2})^p}] \end{aligned} \end{aligned}$$

(8)

Generator adversarial loss $ {\mathcal{L}_{{G_{adv}}}} $ is the cross-entropy loss defined by

$$\begin{aligned} \begin{aligned} {\mathcal{L}_{{G_{adv}}}} = - {{\mathbb {E}}_{G(z) \sim {\textrm{P}_g}}}[D(G(z),s)] \end{aligned} \end{aligned}$$

(9)

Finally, we obtain the loss functions of discriminator, generator, and RIIN as follows:

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal{L}_D} = {\mathcal{L}_{{D_{adv}}}} + {\mathcal{L}_{{D_{MA - GP}}}} + {\mathcal{L}_{{D_f}}} \end{aligned} \end{aligned}$$

(10)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal{L}_G} = {\mathcal{L}_{{G_{adv}}}} + {\mathcal{L}_{{G_f}}},\quad \mathcal{L}_{RIIN}=\mathcal{L}_G. \end{aligned} \end{aligned}$$

(11)

To better understand the training process, we present the implementation details of the proposed RII-GAN framework with the following pseudo-code.

4 Experimental Results

In this section, we first introduce two benchmark datasets, evaluation metrics, and implementation details. Then, we conducted some experiments to compare the proposed RII-GAN with some SOTA GAN-based T2I algorithms, including PCCM-GAN [18], DF-GAN [29], DM-GAN [41], DTGAN [39], SAM-GAN [16], MirrorGAN [19], DR-GAN [28], DiverGAN [40], KD-GAN [17], SSA-GAN [12], DAE-GAN [23], AttnGAN [32], and StackGAN++ [37]. Section 4.4 and Sect. 4.9 discuss the parameter selection and a series of ablation studies.

4.1 Datasets

We conduct all experiments on two standard T2I datasets: CUB-Bird [31] and MS-COCO [14]. The former has 2,933 test birds (50 categories) and 8,855 training birds (150 categories). Each bird has ten English sentences to describe the fine-grained visual scenes. The latter consists of 82,783 training images and 40,504 test images. Each image has five sentence annotations. Compared with CUB-Bird dataset, images in MS-COCO present complex visual scenes, which make the T2I generation tasks more challenging.

4.2 Evaluation Metrics

We use three broadly utilized metrics CLIPScore (CS) [7], Inception Score (IS) [24], and Fréchet Inception Distance (FID) [8] to quantify the performance of all approaches in terms of image-text alignment and image quality.

Image-text alignment: CS adopts a pre-trained CLIP model [21] to map image and text into the same feature space and calculates the cosine similarity between the text feature and image feature. A larger CS indicates that the generated image has a more significant semantic similarity to the text.

Image quality: IS evaluates image quality by calculating KL divergence between marginal distribution (authentic image) and conditional distribution (generated image). The larger IS represents that the generated image is high quality in authenticity and diversity. FID is a metric used to measure the distribution consistency between generated and authentic images. A lower FID means the generated image is close to the authentic image.

4.3 Implementation Details

Our proposed model RII-GAN is implemented using the PyTorch toolbox. The model training consists of two distinct phases. In the initial phase, the two independent BiLSTMs [21] are trained as text encoders, leveraging the DAMSM approach for computing image-text similarity [2]. Specifically, we employ DAMSM to train a pair of text and image encoders that can align with each other, thereby obtaining an efficient text encoder. Each BiLSTM is trained using CUB-Bird and MS-COCO datasets on a single RTX 3090 GPU. The BiLSTM outputs global sentence vectors with 256 dimensions used as text embeddings for the subsequent generator, RIIN, and the discriminator. In the second phase, we freeze the text encoder and train the GAN. We assign a batch size of 32 for both CUB-Bird and MS-COCO datasets. The generator and RIIN in our framework are simultaneously trained using Adam optimizer [9], with parameters $\beta _1$ and $\beta _2$ set to 0.5 and 0.999, respectively. The optimizer is initiated with a learning rate $2e-4$, which starts to linearly decays over the final one-third of the training epochs approaching $5e-5$. In parallel, the discriminator also employs the Adam optimizer with $\beta _1$ and $\beta _2$ set to 0.5 and 0.999, respectively and a fixed learning rate of $4e-4$. Throughout the second phase, RIIN training status is synchronized with the generator and alternately trained with the discriminator. The hyperparameters ${\lambda _1}$ and ${\lambda _2}$ are respectively set to 2 and 3. The subsequent experimental sections present a comprehensive discussion of the choice of these parameters. The model undergoes training for 600 epochs on CUB-Bird dataset and 200 on MS-COCO dataset. The image generation process only retains the generator component during the testing phase.

4.4 Parametric Sensitivity Analysis

Table 1

Quantitative comparison of different models on CUB-Bird and MS-COCO datasets

Methods	CUB-Bird			MS-COCO
Methods	FID$\downarrow $	IS$\uparrow $	CS$\uparrow $	FID$\downarrow $	CS$\uparrow $
AttnGAN (CVPR’18)	23.98	4.26±.03	70.83	35.49	52.58
StackGAN++ (TPAMI’18)	15.30	4.04±.06	–	81.59	–
MirrorGAN (CVPR’19)	18.34	4.56±.05	–	34.71	–
DM-GAN (CVPR’19)	16.09	4.75±.07	71.91	32.64	53.23
PCCM-GAN (Neuro’21)	22.15	4.65±.20	–	33.59	–
SAM-GAN (Neural Network’21)	20.49	4.61±.03	–	33.41	–
DTGAN (IJCNN’21)	16.35	4.88±.03	–	23.61	–
DAE-GAN (ICCV’21)	15.29	4.42±.04	71.8	28.12	54.04
KD-GAN (TMM’21)	13.89	4.90±.06	–	23.92	–
SSA-GAN (CVPR’22)	15.61	5.17±.08	–	19.37	–
DR-GAN (TNNLS’22)	14.96	4.90±.05	–	27.8	–
DF-GAN (CVPR’22)	14.81	5.10±. –	70.63	19.32	61.91
DiverGAN (Neuro’22)	15.63	4.98±.06	–	20.52	–
RII-GAN (Ours)	12.94	5.41±.02	70.85	19.01	60.28

The italic values denotes the best result

In this study, $ \lambda _1 $ and $ \lambda _2 $ are two important hyper-parameters in the feature alignment loss $ \mathcal{L}_{{G_f}} $ of the generator and RIIN. According to (2), these parameters will guide the generator and RIIN to pay more attention to the feature corresponding to the larger parameter. Figure 5 shows the FID scores in the last 200 training epochs on CUB-Bird dataset with different $\lambda _1$ and $\lambda _2$. The first row evaluates the effect of $\lambda _1$ when $\lambda _2$ is fixed as 3, and the second row investigates the impact of $\lambda _2$ when $\lambda _1$ is fixed as 2. The best FID score with different parameters is marked with red numerals. Empirically, we found from the first and the second rows of Fig. 5 that when $\lambda _1=1,1.5,2.5,3 (\lambda _2=3)$ or $\lambda _2=2,2.5,3.5,4 (\lambda _1=2)$, the FID scores are unstable and fluctuates obviously with the increase of epochs. Figure 5j and k summarize the optimal and average FID scores obtained in the last 200 training epochs when $\lambda _1$ and $\lambda _2$ take different values. We observe that when $\lambda _1=2$, $\lambda _2=3$, our model’s average, and optimal FID scores reach the bottom, indicating that the network reaches a local minimum. As shown in Fig. 5i, taking $\lambda _1=2$, $\lambda _2=3$, the model obtains the minimum FID score under relative stability, especially in the last 100 epochs. Therefore, parameters $\lambda _1$ and $\lambda _2$ are set to 2 and 3 in all experiments.

4.5 Quantitative Results

Table 1 shows the quantitative results of our RII-GAN and several advanced T2I models on CUB-Bird and MS-COCO datasets. The results of these T2I models on two datasets come from their publicly available codes on the web. This table shows that our RII-GAN obtains the third-highest CS, lower than the baseline DM-GAN. The DM-GAN achieves the best CS (71.91), indicating better text-image consistency maintenance on CUB-Bird dataset. Our RII-GAN on CUB-Bird dataset achieves the lowest FID scores (12.94) and the highest IS (5.41). These results show that the images generated by our RII-GAN are closer to the actual image distribution in terms of maintaining image quality. On the large-scale MS-COCO dataset, our RII-GAN also achieves competitive results. The statistics in the second last column in Table 1 show that our RII-GAN achieves the lowest FID score of 19.01 among current state-of-art methods, which indicates the ability to generate more realistic images. The CS of our RII-GAN is much higher than that of AttnGAN, DM-GAN, and DAE-GAN and is almost on the same level as the state-of-the-art DF-GAN. The score is relatively much better than the CS scores in CUB-Bird dataset. It illustrates that our RII-GAN is much superior to AttnGAN, DM-GAN, and DAE-GAN in terms of semantic consistency on large datasets with the help of RIIN, AAB, and DFAD. Three metrics on both datasets demonstrate that our RII-GAN can generate higher-quality images.

4.6 Qualitative Results

In this section, we visualized some synthesized images by our RII-GAN and three other advanced models: AttnGAN [32], DM-GAN [41], and DF-GAN [29] on CUB-Bird and MS-COCO datasets shown in Fig. 6. AttnGAN and DM-GAN are two classical multi-stage generation methods, and DF-GAN and our RII-GAN are one-stage generation methods. The images in each column are generated by different approaches based on the scene’s text at the top of each column. These text descriptions are randomly selected from datasets. According to semantic consistency, our method captures more text detail descriptions. As shown in Fig. 6, in the first column, our RII-GAN captures the “yellow on its tips of its wings” detail in “green color with yellow on the tips of its wings” and reflects it in the generated image, which fails in the other methods. In the third column, the generated images of AttnGAN, DMGAN, and DF-GAN show a lot of overlap between the water scene and the bird body. By contrast, the images generated by our RII-GAN method can reflect the details and colors of birds on the lake. For a more challenging MS-COCO dataset, it can be observed that our RII-GAN can still generate images of the complex scene, e.g., RII-GAN successfully focuses on the “clock tower” keyword information. In contrast, AttnGAN, DMGAN, and DF-GAN only generated objects with inaccurate shapes through the keyword “tower” (the fifth column). In columns 6 and 7, the “bulls” and “zebras” are visible in our method but hard to recognize in others. As shown in the 8th column, the proposed model better reflects the meaning of the text, such as “a group of people”. Generally, these subjective visual comparison results confirm the effectiveness of the proposed RII-GAN.

4.7 User Evaluation

Since there is no adequate convincing criterion to judge the consistency of image-text semantics of the generated image, we conducted a human test shown in Fig. 7. The experiment randomly selected 100 text descriptions from CUB-Bird dataset and then invited 30 volunteers to compare the results of our RII-GAN and two popular baselines AttnGAN [32] and DF-GAN [29]. In this test, the user receives three images generated from three models and a corresponding text description. Each user sorts these images according to realism and text alignment. As summarized in Fig. 7, our RII-GAN achieved the highest ranking of $50.46\%$ and $63.08\%$, respectively, regarding text alignment and realism. The conclusion also shows that our RII-GAN has obtained better approval in human perception.

4.8 Performance Analysis

Visualization by t-SNE analysis. To investigate the feature distribution between the RIIN and the proposed adaptive affine-based generator, we utilize the t-SNE algorithm to visualize the feature alignment between the generator (blue triangles) and our RIIN (orange circles) on CUB-Bird dataset. Figure 8a shows the two-dimensional distribution of noise $z\sim N(0,1)$ and real images before forward computing. Figures 8b, c, d show feature distributions of different scales. We found that the feature distance between the generator and RIIN is large in the initial stage. However, as the forward computing of RIIN gets deep into the reversed image interaction process (from $ 128 \times 128 $ to $ 8 \times 8 $). The feature distributions of both modules successfully converge on the same intermediate domain (see Figs. 8c, d).

Image diversity The randomness of the noise z controls the diversity of the generated images. z is sampled from a Gaussian distribution N(0, 1). To explore the influence of z on the generated images, we conducted control-variable experiments on the text input and random noise z of our RII-GAN, as shown in Fig. 9. The random noise z is fixed at the same level. The sentence embedding of the input model varies by changing the specific words in the text description. From the generation images shown in Fig. 9, we found when the color attribute in the description changes, the images generated by RII-GAN can keep other elements except for the color unchanged (such as the bird’s pose and background information). This result demonstrates the high controllability of our method for generating images. If the text embedding is fixed under the same vertical line, changing the random sampling noise z, we found that different z significantly impacts the object pose and background of the generated image. By varying the input z, the generator synthesizes bird images with different angles and backgrounds. Therefore, the results of Fig. 9 demonstrate that our RII-GAN can effectively disentangle the attribute of the input text description and control the modeling of the sample relevant regions while not affecting the generation of non-relevant regions during the disentanglement process. Figure 10 presents more examples of images synthesized by the proposed RII-GAN on CUB-Bird dataset.

4.9 Ablation Study

Impact of RIIN, DFAD and AAB. The proposed framework comprises three main modules: AAB, RIIN, and DFAD. We carried out ablation experiments to validate the impact of each module in our RII-GAN on CUB-Bird dataset. The baseline model is a one-stage GAN-based T2I network. The evaluation metrics are FID and IS summarized in Table 2. The role of AAB is to use the current feature layer to adaptively provide a basis for updating text embedding. Without (w/o) AAB, the FID and IS scores are 13.25 and 5.37±0.03 on CUB-Bird dataset while the FID value on MS-COCO dataset is 21.23. It verifies that adding AAB into the framework helps generate more realistic images. We also investigate our model with (w/) AAB and RIIN, and without DFAD, we find the FID increases from 12.94 to 14.03, and IS decays from 5.41 to 5.39 on CUB-Bird dataset and the FID increases from 19.01 to 20.03 on MS-COCO dataset. It can be concluded from the table that the use of the dual-channel feature alignment structure is beneficial for semantic extraction and matching of the discriminator. Finally, we also evaluate the effect of the RIIN module. When RIIN is removed from our model, the FID has significantly increased by 9.89%, and IS has decreased by 2.96% on CUB-Bird dataset. On the other hand, the FID increased substantially by 11.68% on MS-COCO dataset. The results in Table 2 reflect that the proposed RIIN has a more significant impact on the performance of our framework than AAB and DFAD.

Table 2

The performance of different components of our model on CUB-Bird and MS-COCO datasets, in which “AAB” means adaptive affine block, “DFAD” represents dual-channel feature alignment discriminator, and “RIIN” stands for reversed image interaction network

Components			CUB-Bird		MS-COCO
RIIN	DFAD	AAB	FID$\downarrow $	IS$\uparrow $	FID$\downarrow $
w/o	w/o	w/o	15.89	5.12±0.01	23.47
w/	w/	w/o	13.25	5.37±0.03	21.23
w/	w/o	w/	14.03	5.39±0.03	20.03
w/o	w/	w/	14.22	5.25±0.02	19.84
w/	w/	w/	12.94	5.41±0.02	19.01