1 Introduction
-
We propose a reversed image interaction network to help the generator learn the authentic image feature manifold structure through multi-scaled feature alignment constraints. This alleviates the reliance of the generator on input text description and provides an alternative generation scheme for GAN-based multimodal generation tasks.
-
We explore an adaptive affine-based (AAB) generator which designs an adaptive block and establishes an adaptive updating scheme for text-image fusion.
-
We propose a dual-channel feature alignment discriminator, which allows simultaneous extraction of textual and visual features in each modality and implements an advanced feature alignment strategy to improve semantic consistency.
2 Related Works
2.1 GAN-Based Text-to-Image Synthesis
2.2 Text-to-Image Fusion
2.3 Feature Alignment in Text and Image
3 Methodology
3.1 Reversed Image Interaction Network: RIIN
3.2 Adaptive Affine-Based Generator
3.3 Dual-Channel Feature Alignment Discriminator: DFAD
4 Experimental Results
4.1 Datasets
4.2 Evaluation Metrics
4.3 Implementation Details
4.4 Parametric Sensitivity Analysis
Methods | CUB-Bird | MS-COCO | |||
---|---|---|---|---|---|
FID\(\downarrow \) | IS\(\uparrow \) | CS\(\uparrow \) | FID\(\downarrow \) | CS\(\uparrow \) | |
AttnGAN (CVPR’18) | 23.98 | 4.26±.03 | 70.83 | 35.49 | 52.58 |
StackGAN++ (TPAMI’18) | 15.30 | 4.04±.06 | – | 81.59 | – |
MirrorGAN (CVPR’19) | 18.34 | 4.56±.05 | – | 34.71 | – |
DM-GAN (CVPR’19) | 16.09 | 4.75±.07 | 71.91 | 32.64 | 53.23 |
PCCM-GAN (Neuro’21) | 22.15 | 4.65±.20 | – | 33.59 | – |
SAM-GAN (Neural Network’21) | 20.49 | 4.61±.03 | – | 33.41 | – |
DTGAN (IJCNN’21) | 16.35 | 4.88±.03 | – | 23.61 | – |
DAE-GAN (ICCV’21) | 15.29 | 4.42±.04 | 71.8 | 28.12 | 54.04 |
KD-GAN (TMM’21) | 13.89 | 4.90±.06 | – | 23.92 | – |
SSA-GAN (CVPR’22) | 15.61 | 5.17±.08 | – | 19.37 | – |
DR-GAN (TNNLS’22) | 14.96 | 4.90±.05 | – | 27.8 | – |
DF-GAN (CVPR’22) | 14.81 | 5.10±. – | 70.63 | 19.32 | 61.91 |
DiverGAN (Neuro’22) | 15.63 | 4.98±.06 | – | 20.52 | – |
RII-GAN (Ours) | 12.94 | 5.41±.02 | 70.85 | 19.01 | 60.28 |
4.5 Quantitative Results
4.6 Qualitative Results
4.7 User Evaluation
4.8 Performance Analysis
4.9 Ablation Study
Components | CUB-Bird | MS-COCO | |||
---|---|---|---|---|---|
RIIN | DFAD | AAB | FID\(\downarrow \) | IS\(\uparrow \) | FID\(\downarrow \) |
w/o | w/o | w/o | 15.89 | 5.12±0.01 | 23.47 |
w/ | w/ | w/o | 13.25 | 5.37±0.03 | 21.23 |
w/ | w/o | w/ | 14.03 | 5.39±0.03 | 20.03 |
w/o | w/ | w/ | 14.22 | 5.25±0.02 | 19.84 |
w/ | w/ | w/ | 12.94 | 5.41±0.02 | 19.01 |
Architectures | FID \( \downarrow \) | IS \( \uparrow \) |
---|---|---|
DFAD | 12.94 | 5.41±0.02 |
DFAD w/o Aligning | 15.92 | 5.41±0.01 |
DFAD w/o Image Channel | 13.57 | 5.30±0.03 |
DFAD w/o Text Channel | 16.91 | 5.47±0.05 |
Architecture | FID \(\downarrow \) | IS \(\uparrow \) |
---|---|---|
RIIN | 12.94 | 5.41±0.02 |
RIIN w/o \(8\times 8\) branch | 13.12 | 5.37±0.04 |
RIIN w/o \(32\times 32\) branch | 13.45 | 5.32±0.01 |
RIIN w/o \(128\times 128\) branch | 14.10 | 5.27±0.01 |
Architecture | Stages | FID \(\downarrow \) | IS \(\uparrow \) |
---|---|---|---|
AAB | 1 | 13.33 | 5.32±0.04 |
2 | 13.43 | 5.36±0.03 | |
3 | 13.02 | 5.29±0.01 | |
4 | 13.15 | 5.39±0.04 | |
5 | 13.01 | 5.38±0.02 | |
6 | 12.94 | 5.41±0.02 |
Architecture | Numbers | FID \(\downarrow \) | IS \(\uparrow \) |
---|---|---|---|
AAB | 1 | 14.10 | 5.12±0.03 |
2 | 12.94 | 5.41±0.02 | |
3 | 13.13 | 5.22±0.04 | |
4 | 13.72 | 5.19±0.01 |