1 Introduction
-
A novel deep learning architecture that employs the logic of NARX models in QAR for forecasting the popularity of new products that lack historical data. We compare various QAR models, including: CNN, LSTM, ConvLSTM, Feedback-LSTM, Transformers and DA-RNN.
-
Integration of image captioning in the multimodal module for capturing contextual and positional relations of the products’ attributes.
-
A new large-scale fashion dataset that includes popularity scores in relation to demographic groups allowing specialized forecasts for different market segments.
-
An extensive ablation study on three datasets (two fashion and one home decoration) to assess the validity and generalization of the proposed methodology. A comparative study on a fourth fashion-related dataset shows that our model surpasses the domain’s state of the art by 4.65% and 4.8% in terms of WAPE and MAE, respectively.
2 Related work
3 Methodology
3.1 FusionMLP
3.2 QAR
3.3 Prediction component
4 Experimental setup
4.1 Evaluation protocol
4.2 Datasets
4.2.1 VISUELLE
4.2.2 SHIFT15m
4.2.3 Mallzee-P
4.2.4 Amazon reviews: home and kitchen
4.3 Implementation details
Model | Layer | \(n_{\{Layer\}}\) | \(u_{\{Layer\}}\) |
---|---|---|---|
LSTM | lstm | 1, 2, 3 | (512, 256, 128) |
CNN | cnn | 1, 2, 3 | (512, 256, 128) |
DA-RNN | lstm | 2 | 64 or 128 |
ConvLSTM | cnn | 1, 2, 3 | (512, 256, 128) |
lstm | 1, 2, 3 | (512, 256, 128) | |
F-LSTM | lstm | 1, 2, 3 | (512, 256, 128) |
mlp | 0,1 | 256 or 512 | |
Transformer | block | 1, 2, 3 | 128, 256 or 512 |
head | 2 or 4 |
Input | Model | MAE(\(\downarrow \)) | PCC(\(\uparrow \)) | Accuracy(\(\uparrow \)) | AUC(\(\uparrow \)) | ||||
---|---|---|---|---|---|---|---|---|---|
MLZ-P | SHIFT15m | MLZ-P | SHIFT15m | MLZ-P | SHIFT15m | Amazon | Amazon | ||
[A] | LR | 0.1878 | 0.1162 | 0.2439 | 0.3177 | 63.10 | 59.58 | 48.52 | 65.41 |
CNN | \(\underline{0.1611}\) | 0.1148 | \(\underline{0.5379}\) | 0.3406 | \(\underline{70.99}\) | 61.51 | 47.18 | 69.34 | |
LSTM | 0.1656 | 0.1150 | 0.5109 | 0.3371 | 69.67 | 61.42 | 45.58 | 67.54 | |
F-LSTM | 0.1809 | 0.1149 | 0.3395 | 0.3376 | 64.62 | 61.43 | 44.95 | 67.89 | |
Transformer | 0.1842 | 0.1149 | 0.3071 | 0.3398 | 63.67 | 61.28 | \(\underline{51.10}\) | \(\underline{71.29}\) | |
ConvLSTM | 0.1641 | \(\underline{0.1147}\) | 0.5225 | \(\underline{0.3411}\) | 69.98 | \(\underline{61.58}\) | 46.58 | 68.59 | |
[A+X] | ConvLSTM+X | \(\underline{0.1686}\) | \(\underline{0.1185}\) | \(\underline{0.4913}\) | \(\underline{0.2191}\) | \(\underline{68.86}\) | 59.50 | 60.61 | \(\underline{78.95}\) |
DA-RNN | 0.1863 | 0.1187 | 0.2652 | 0.2050 | 64.39 | \(\underline{59.55}\) | 60.61 | 78.90 | |
[I] | LR | 0.1599 | 0.1186 | 0.5314 | 0.1940 | 71.86 | 57.93 | 41.46 | 68.18 |
FusionMLP | 0.1074 | 0.1148 | 0.7893 | 0.2811 | 81.52 | 60.89 | 46.96 | 71.69 | |
[I+C] | FusionMLP | 0.1073 | – | 0.7879 | – | 81.30 | – | – | – |
[I+A] | MuQAR | 0.0949 | 0.1100 | 0.8362 | 0.3934 | 83.41 | 63.57 | 51.51 | 74.24 |
[I+A+C] | MuQAR | 0.0928 | – | 0.8474 | – | 83.69 | – | – | – |
[I+A+X+C] | MuQAR | 0.0911 | 0.1118* | 0.8484 | 0.3448* | 84.26 | 62.30* | 60.63* | 80.40* |
5 Results
5.1 Ablation analysis
5.1.1 QAR
Method | Input | IN:52, OUT:6 | |
---|---|---|---|
WAPE(\(\downarrow \)) | MAE(\(\downarrow \)) | ||
GTM-Transformer [25] | [T] | 62.6 | 34.2 |
Attribute KNN [8] | 59.8 | 32.7 | |
FusionMLP | \(\underline{55.15}\) | \(\underline{30.12}\) | |
Image KNN [8] | [I] | 62.2 | 34.0 |
GTM-Transformer [25] | 56.4 | 30.8 | |
FusionMLP | \(\underline{54.59}\) | \(\underline{29.82}\) | |
Transformer | [A] | 62.5 | 34.1 |
LSTM | 58.7 | 32.0 | |
ConvLSTM | 58.6 | 32.0 | |
GTM-Transformer [25] | 58.2 | 31.8 | |
F-LSTM | 58.0 | 31.7 | |
CNN | \(\underline{57.4}\) | \(\underline{31.4}\) | |
ConvLSTM+X | [A + X] | \(\underline{55.73}\) | \(\underline{30.44}\) |
DA-RNN | 58.05 | 31.71 | |
Attribute + Image KNN [8] | [T + I] | 61.3 | 33.5 |
Cross-Attention RNN [8] | 59.5 | 32.3 | |
GTM-Transformer [25] | 56.7 | 30.9 | |
FusionMLP | \(\underline{54.11}\) | \(\underline{29.56}\) | |
FusionMLP | [T + I + C] | 53.50 | 29.22 |
GTM-Transformer AR [25] | [T + I + A] | 59.6 | 32.5 |
Cross-Attention RNN+A [8] | 59.0 | 32.1 | |
GTM-Transformer [25] | 55.2 | 30.2 | |
MuQAR w/ Transformer | 54.87 | 29.97 | |
MuQAR w/ F-LSTM | 54.37 | 29.7 | |
MuQAR w/LSTM | 54.3 | 29.66 | |
MuQAR w/CNN | 53.9 | 29.44 | |
MuQAR w/ ConvLSTM | \(\underline{53.61}\) | \(\underline{29.28}\) | |
MuQAR w/ DA-RNN | [T + I + A + X + C] | 54.43 | 29.73 |
MuQAR w/ ConvLSTM+X | \(\underline{{\textbf {52.63}}}\) | \(\underline{{\textbf {28.75}}}\) |