Introduction
Related works
Model architecture
Wavelet transform based CNN model
Sl. No. | Level | Name of convolutional layers | Kernel size/No. of filters | Output size |
---|---|---|---|---|
1 | L1 | Conv 1_1 | 3x3/64 | 256x256x64 |
2 | Conv 1_2 | 3x3/64 | 256x256x64 | |
3 | Maxpool1 | 2x2/64/stride 2 | 128x128x64 | |
4 | L2 | Conv 2_1 | 3x3/128 | 128x128x128 |
5 | Conv 2_2 | 3x3/128 | 128x128x128 | |
6 | Maxpool2 | 2x2/128/stride 2 | 64x64x128 | |
7 | L3 | Conv 3_1 | 5x5/256 | 64x64x256 |
8 | Conv 3_2 | 5x5/256 | 64x64x256 | |
9 | Conv 3_3 | 5x5/256 | 64x64x256 | |
10 | Maxpool3 | 2x2/256/stride 2 | 32x32x256 | |
11 | L4 | Conv 4_1 | 7x7/512 | 32x32x512 |
12 | Conv 4_2 | 7x7/512 | 32x32x512 | |
13 | Conv 4_3 | 7x7/512 | 32x32x512 |
Visual attention predictor network
Contextual spatial relation extractor
Experiments and results
Datasets and performance evaluation metrics used
Implementation details
Analysis for the selection of appropriate mother wavelet
Mother wavelet | Flickr8K | Flickr30K | MSCOCO | |||
---|---|---|---|---|---|---|
B@4 | CD | B@4 | CD | B@4 | CD | |
BM | 24.43 | 58.31 | 23.68 | 57.89 | 35.78 | 118.02 |
db1 | 25.77 | 59.37 | 24.87 | 58.91 | 36.57 | 119.84 |
db4 | 25.86 | 59.56 | 25.01 | 59.14 | 36.82 | 119.95 |
bior1.5 | 26.34 | 60.58 | 25.30 | 60.13 | 37.14 | 120.41 |
bior2.4 | 26.18 | 60.52 | 25.32 | 60.02 | 37.01 | 120.16 |
bior3.5 | 26.04 | 60.19 | 25.03 | 59.84 | 36.89 | 120.03 |
bior5.5 | 25.85 | 59.92 | 24.84 | 59.77 | 36.77 | 119.98 |
Coif2 | 25.96 | 59.77 | 24.97 | 59.52 | 36.79 | 119.64 |
Coif5 | 26.08 | 59.63 | 24.82 | 59.03 | 36.62 | 119.58 |
Sym2 | 25.81 | 59.68 | 24.73 | 58.78 | 36.81 | 119.80 |
Sym4 | 24.97 | 59.72 | 24.61 | 58.65 | 36.73 | 119.63 |
Analysis for the choice of DWT decomposition levels
Decomposition levels | MSCOCO | ||
---|---|---|---|
B@4 | MT | CD | |
1-level | 52.87 | 35.14 | 90.39 |
2-level | 53.64 | 36.53 | 91.71 |
3-level | 53.69 | 36.90 | 91.89 |
Quantitative analysis
Method | B@1 | B@2 | B@3 | B@4 | MT | R | CD |
---|---|---|---|---|---|---|---|
Deep VS [25] | 62.5 | 45.0 | 32.1 | 23.0 | 19.5 | - | 66.0 |
emb-gLSTM [5] | 67.0 | 49.1 | 35.8 | 26.4 | 22.74 | - | 81.25 |
Soft attn [17] | 70.7 | 49.2 | 34.4 | 24.3 | 23.9 | - | - |
Hard attn [17] | 71.8 | 50.4 | 35.7 | 25.0 | 23.04 | - | - |
ATT [55] | 70.9 | 53.7 | 40.2 | 30.4 | 24.3 | - | - |
SCA-CNN [14] | 71.9 | 54.8 | 41.1 | 31.1 | 25.0 | - | - |
LSTM-A [56] | 75.4 | – | – | 35.2 | 26.9 | 55.8 | 108.8 |
Up-down [7] | 77.2 | – | – | 36.2 | 27.0 | 56.4 | 113.5 |
SCST [42] | – | – | – | 34.2 | 26.7 | 55.7 | 114.0 |
RFNet [20] | 76.4 | 60.4 | 46.6 | 35.8 | 27.4 | 56.5 | 112.5 |
GCN-LSTM [57] | 77.4 | – | – | 37.1 | 28.1 | 57.2 | 117.1 |
avtmNet [58] | – | – | – | 33.2 | 27.3 | 56.7 | 112.6 |
ERNN [59] | 73.2 | 56.9 | 42.9 | 32.2 | 25.2 | - | 101.4 |
Tri-LSTM [62] | – | – | – | 37.3 | 28.4 | 58.1 | 123.5 |
TDA+GLD [61] | 78.8 | 62.6 | 48.0 | 36.1 | 27.8 | 57.1 | 121.1 |
Ours | 78.5 | 62.0 | 49.1 | 38.2 | 28.9 | 58.3 | 124.2 |
Method | B@1 | B@2 | B@3 | B@4 | MT |
---|---|---|---|---|---|
Deep VS [25] | 57.9 | 38.3 | 24.5 | 16.0 | - |
emb-gLSTM [5] | 64.7 | 45.9 | 31.8 | 21.2 | 20.6 |
Soft attn [17] | 67.0 | 44.8 | 29.9 | 19.5 | 18.9 |
Hard attn [17] | 67.0 | 45.7 | 31.4 | 21.3 | 20.3 |
SCA-CNN [14] | 68.2 | 49.6 | 35.9 | 25.8 | 22.4 |
Ours | 70.5 | 50.2 | 37.3 | 28.6 | 24.5 |
Method | B@1 | B@2 | B@3 | B@4 | MT | CD |
---|---|---|---|---|---|---|
Deep VS [25] | 57.3 | 36.9 | 24.0 | 15.7 | 15.3 | |
emb-gLSTM [5] | 64.6 | 44.6 | 30.5 | 20.6 | 17.9 | - |
Soft attn [17] | 66.7 | 43.4 | 28.8 | 19.1 | 18.5 | - |
Hard attn [17] | 66.9 | 43.9 | 29.6 | 19.9 | 18.5 | - |
ATT [55] | 64.7 | 46.0 | 32.4 | 23.0 | 18.9 | - |
SCA-CNN [14] | 66.2 | 46.8 | 32.5 | 22.3 | 19.5 | - |
avtmNet [58] | – | – | – | 24.8 | 20.8 | 59.8 |
Ours | 70.1 | 49.4 | 35.8 | 27.2 | 21.7 | 67.3 |
Qualitative results
Ablation study
Configuration | Cross-Entropy loss | Self-Critical loss | ||
---|---|---|---|---|
B@4 | CD | B@4 | CD | |
WCNN+atr+LSTM | 33.1 | 109.2 | 34.4 | 116.5 |
WCNN+atr+SA+LSTM | 33.9 | 110.8 | 35.7 | 117.9 |
WCNN+atr+SA+CA+LSTM | 35.2 | 112.7 | 36.3 | 119.0 |
WCNN+atr+CA+SA+LSTM | 35.9 | 113.4 | 37.1 | 120.4 |
WCNN+atr+CA+SA+CSE+LSTM | 37.5 | 116.9 | 38.2 | 124.2 |