1 Introduction
-
We release a highly modular open-source framework called LOBCAST,1 to pre-process data, train, and test stock market models. Our framework employs the latest DL libraries to provide all researchers an easy, performing, and maintainable solution. Furthermore, to support future studies, we release two meta-learning models and a backtesting environment for profit analysis.
-
We evaluate existing LOB-based stock market trend predictors, showing that most of them overfit the FI-2010 dataset with remarkably lower performance on unseen stock data.
-
In order to guide model selection in real-world applications, we evaluate the sensitivity of the models to the data labeling parameters, compare the performance of both DL and non-DL models, and evaluate and discuss the financial performance of existing models under different market scenarios.
-
We discuss the strengths and limitations of existing methodology and identify areas for future research toward more reliable, robust, and reproducible approaches to stock market prediction.
2 Related work
3 The stock price trend prediction problem
3.1 Limit order book (LOB)
3.2 Trend definition
U
(“upward”) if the price trend is increasing; D
(“downward”) for decreasing prices; and S
(“stable”) for prices with negligible variations. Among all the possible single values, mid-prices provide the most reliable indication of the actual stock price for equity markets. Nevertheless, because of the market’s inherent fluctuations and shocks, they can exhibit highly volatile trends. For this reason, using a direct comparison of consecutive mid-prices, i.e., m(t) and \(m(t+1)\), for stock price labelling would result in a noisy labelled dataset. As a result, labelling strategies typically employ smoother mid-price functions instead of raw mid-prices. Such functions consider mid-prices over arbitrarily long time intervals, called horizons. Our experiments adopt the labelling proposed in Ntakaris et al. (2018) and repurposed in several other SOTA solutions we selected for benchmarking. The adopted labelling strategy compares the current mid-price to the average mid-prices \(a^+(k, t)\) in a future horizon of k time units, formally:3.3 Models I/O
U
, D
, and S
.
4 Models
4.1 Summary of models
Temporal shape (h) | Features shape | Code available | Nr trainable parameters | Inference time (ms) | |
---|---|---|---|---|---|
Tsantekidis et al. (2017a) | 100 | 40 | ✗ |
\(10^6\) | 0.08 |
MLP (2017) | |||||
Tsantekidis et al. (2017a) | 100 | 40 | ✗ |
\(1.6\cdot 10^4\) | 0.21 |
LSTM (2017) | |||||
Tsantekidis et al. (2017b) | 100 | 40 | ✗ |
\(3.5\cdot 10^4\) | 0.36 |
CNN1 (2017) | |||||
Tran et al. (2018) | 10 | 40 | TensorFlow |
\(1.1\cdot 10^4\) | 0.48 |
CTABL (2018) | |||||
Zhang et al. (2019) | 100 | 40 | PyTorch |
\(1.4\cdot 10^5\) | 1.31 |
DEEPLOB (2019) | |||||
Passalis et al. (2019) | 15 | 144 | PyTorch |
\(2.1\cdot 10^6\) | 0.15 |
DAIN (2019) | |||||
Tsantekidis et al. (2020) | 300 | 42 | ✗ |
\(5.3\cdot 10^4\) | 0.50 |
CNNLSTM (2020) | |||||
Tsantekidis et al. (2020) | 300 | 40 | ✗ |
\(2.8\cdot 10^5\) | 0.49 |
CNN2 (2020) | |||||
Wallbridge (2020) | 100 | 40 | TensorFlow |
\(1.1\cdot 10^5\) | 2.40 |
TRANSLOB (2020) | |||||
Passalis et al. (2020) | 15 | 144 | PyTorch |
\(6.5\cdot 10^5\) | 0.43 |
TLONBoF (2020) | |||||
Tran et al. (2021) | 10 | 40 | ✗ |
\(1.1\cdot 10^4\) | 0.71 |
BINCTABL (2021) | |||||
Zhang et al. (2021) | 50 | 40 | TensorFlow |
\(1.8\cdot 10^5\) | 1.73 |
DEEPLOBATT (2021) | |||||
Guo and Chen (2022) | 5 | 144 | ✗ |
\(1.2\cdot 10^5\) | 0.23 |
DLA (2022) | |||||
Tran et al. (2022) | 100 | 40 | PyTorch |
\(1.3\cdot 10^7\) | 3.90 |
ATNBoF (2022) | |||||
Kisiel and Gorse (2022) | 40 | 40 | ✗ |
\(2\cdot 10^4\) | 1.91 |
AXIALLOB (2021) |
4.2 Ensemble methods
5 Datasets
5.1 FI-2010 to test robustness
Horizon k | Train Set { U,S,D } (%) | Val Set { U,S,D } (%) | Test Set { U,S,D } (%) |
---|---|---|---|
1 |
\(20-60-20\) |
\(19-63-18\) |
\(15-71-14\) |
2 |
\(26-49-25\) |
\(24-52-24\) |
\(20-62-18\) |
3 |
\(30-41-29\) |
\(27-46-27\) |
\(23-56-21\) |
5 |
\(35-30-35\) |
\(32-37-31\) |
\(28-47-25\) |
10 |
\(41-18-41\) |
\(37-26-37\) |
\(34-34-32\) |
S
is progressively less predominant in favour of the upward and downward classes. In our experimental campaign, the class imbalance is not addressed to guarantee a fair robustness evaluation since the considered works do not claim to have done so.Stock | Daily Return (%) | Hourly Return (%) | Market Cap. | P/E Ratio | Train Set { U, S, D } (%), \(k=5\) | Train Set (%), \(k=5\) |
---|---|---|---|---|---|---|
SOFI |
\(-2.3\,\pm\,3.1\) |
\(-0.3\,\pm\,1.2\) |
\(4.26 \cdot 10^9\) |
\(-27.84\) |
\(41-19-40\) | 14.8 |
NFLX |
\(0.6\,\pm\,1.7\) |
\(0.05\,\pm\,0.6\) | 1.58 \(\cdot 10^{11}\) | 38.28 |
\(45-5-50\) | 21.7 |
CSCO |
\(0.2\,\pm\,0.7\) |
\(0.02\,\pm\,0.4\) |
\(2 \cdot 10^{11}\) | 17.59 |
\(18-65-17\) | 46.2 |
WING |
\(-0.3\,\pm\,3.2\) |
\(-0.04\,\pm\,0.9\) |
\(6.06 \cdot 10^{9}\) | 96.87 |
\(44-7-49\) | 6.1 |
SHLS |
\(-2.4\,\pm\,4.9\) |
\(-0.3\,\pm\,1.9\) |
\(4.05 \cdot 10^{9}\) | 26.24 |
\(42-14-44\) | 7.4 |
LSTR |
\(0.1\,\pm\,2.8\) |
\(-0.03\,\pm\,0.73\) |
\(6.16 \cdot 10^{9}\) | 16.55 |
\(48-5-47\) | 3.8 |
Horizon k | LOB-2021 | LOB-2022 | ||||
---|---|---|---|---|---|---|
Train Set | Val Set | Test Set | Train Set | Val Set | Test Set | |
{ U,S,D } | { U,S,D } | { U,S,D } | { U,S,D } | { U,S,D } | textbf{ U,S,D } | |
(%) | (%) | (%) | (%) | (%) | (%) | |
1 |
\(18-63-19\) |
\(19-62-19\) |
\(21-59-20\) |
\(20-60-20\) |
\(18-64-18\) |
\(18-63-19\) |
2 |
\(25-50-25\) |
\(25-50-25\) |
\(27-46-27\) |
\(26-47-27\) |
\(24-51-25\) |
\(25-50-25\) |
3 |
\(28-43-29\) |
\(28-43-29\) |
\(30-40-30\) |
\(30-40-30\) |
\(28-43-29\) |
\(29-42-29\) |
5 |
\(32-35-33\) |
\(32-35-33\) |
\(34-33-33\) |
\(34-31-35\) |
\(33-34-33\) |
\(34-32-34\) |
10 |
\(37-25-38\) |
\(37-25-38\) |
\(38-24-38\) |
\(41-28-41\) |
\(40-20-40\) |
\(41-18-41\) |
5.2 LOB-2021/2022 to test generalizability
5.2.1 Stocks selection
SNE
) to capture stock differences in the years 2021-2023. We used the following features: daily return, hourly return, volatility, outstanding shares, P/E ratio, and market capitalization. The P/E ratio indicates the ratio between the price of a stock (P) and the company’s annual earnings per share (E). The analysis led to the identification of 6 stocks that are nearest to the cluster centroids of the generated 3-dimensional latent space. The stocks make up the set \({\mathcal {S}}=\{\)SOFI, NFLX, CSCO, WING, SHLS, LSTR\(\}\). Table 3 captures the main features of these stocks for the period of July 2021. The selected stocks have very variable average daily returns, the minimum being SHLS and the maximum being NFLX. Daily and Hourly returns highlight that some stocks are more volatile than others. The market capitalization represents the total value of the outstanding common shares stockholders own. Stocks show different class balancing in the training set. CSCO is the stock with a major imbalance toward the stable class, whereas NFLX and LSTR are more unbalanced towards the up and down classes, respectively. In Sect. 6, we analyze the reasons behind the occurrence of class imbalance specific to individual stocks and discuss its impact. The mid-price movements for these two periods and the selected stocks are depicted in Fig. 3.5.2.2 Stock processing
5.3 Data distribution shift
6 Experiments
6.1 LOBCAST framework for SPTP
6.1.1 Applications and features
6.2 Hyperparameters search
Model | FI-2010 (Robustness) | LOB-2021/2022 (Generalizability) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Learning Rate | Optimizer | Batch Size | Epochs | Dropout | Learning Rate | Optimizer | Batch Size | Epochs | Dropout | |
LSTM | 0.001 | Adam | 32 | 100 | – | 0.0001 | Adam | 64 | 100 | – |
MLP | 0.001 | Adam | 64 | 100 | – | 0.00001 | Adam | 64 | 100 | – |
CNN1 | 0.0001 | Adam | 64 | 100 | – | 0.0001 | Adam | 32 | 100 | – |
CTABL | 0.01 | Adam | 256 | 200 | – | 0.001 | Adam | 64 | 200 | – |
DAIN | 0.0001 | RMSprop | 32 | 100 | 0.5 | 0.0001 | RMSprop | 64 | 100 | 0.5 |
DEEPLOB | 0.01 | Adam | 32 | 100 | – | 0.01 | Adam | 32 | 100 | – |
CNNLSTM | 0.001 | RMSprop | 32 | 20 | 0.1 | 0.001 | RMSprop | 128 | 100 | 0.1 |
CNN2 | 0.001 | RMSprop | 32 | 100 | – | 0.001 | RMSprop | 128 | 100 | – |
TRANSLOB | 0.0001 | Adam | 32 | 150 | – | 0.001 | Adam | 128 | 100 | – |
TLONBoF | 0.0001 | Adam | 128 | 100 | – | 0.00001 | Adam | 32 | 100 | – |
BINCTABL | 0.001 | Adam | 128 | 200 | – | 0.001 | Adam | 32 | 200 | – |
DEEPLOBATT | 0.001 | Adam | 32 | 100 | – | 0.0001 | Adam | 128 | 100 | – |
AXIALLOB | 0.01 | SGD | 64 | 50 | – | 0.01 | SGD | 64 | 50 | – |
ATNBoF | 0.001 | Adam | 128 | 80 | 0.2 | 0.00001 | Adam | 32 | 80 | 0.2 |
DLA | 0.01 | Adam | 256 | 100 | – | 0.001 | Adam | 64 | 100 | – |
METALOB | 0.0001 | SGD | 64 | 100 | – | 0.0001 | SGD | 64 | 100 | – |
6.3 Performance, robustness and generalizability
6.3.1 Robustness on FI-2010
U
, S
, D
} are distributed as \(\{37\%,25\%,38\%\}\). Since the models perform a ternary classification, each curve represents the micro-average precision-recall value and is generated by setting different thresholds for the classification. Thresholds play a role in defining the number of false negatives and false positives, affecting the resulting values of the Precision and the Recall. The best models are the ones with the largest area under the curve, as they are able to make the most accurate predictions (high Precision) while minimizing the false negative rate (high Recall). The figure also shows the iso-F1 Curves on the PR plane. The best-performing model is BINCTABL, with an area under the curve of 8680.6.3.2 Generalizability on LOB-2021/2022
Model |
\(k = 1\) |
\(k = 2\) |
\(k = 3\) |
\(k = 5\) |
\(k = 10\) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FI 2010 | FI
\(^r\) 2010 | LOB 2021 | LOB 2022 | FI 2010 | FI
\(^r\) 2010 | LOB 2021 | LOB 2022 | FI 2010 | FI
\(^r\) 2010 | LOB 2021 | LOB 2022 | FI 2010 | FI
\(^r\) 2010 | LOB 2021 | LOB 2022 | FI 2010 | FI
\(^r\) 2010 | LOB 2021 | LOB 2022 | |
MLP | 48.3 | 48.2 | 48.3 | 51.1 | 51.1 | 44.0 | 56.2 | 54.1 | – | 47.2 | 58.2 | 55.9 | 56.0 | 49.0 | 59.2 | 55.0 | – | 51.6 | 55.4 | 49.3 |
LSTM | 66.3 | 66.5 | 49.6 | 53.7 | 62.4 | 58.8 | 58.0 | 57.4 | – | 65.3 | 60.3 | 60.6 | 61.4 | 66.9 | 60.6 | 56.2 | – | 59.4 | 56.0 | 52.6 |
CNN1 | 55.2 | 49.3 | 52.5 | 55.3 | 59.2 | 46.1 | 57.7 | 59.8 | – | 62.3 | 60.2 | 59.3 | 59.4 | 65.8 | 60.1 | 58.5 | – | 67.2 | 56.7 | 52.6 |
CTABL | 77.6 | 69.5 | 55.3 | 57.8 | 66.9 | 62.4 | 60.7 | 60.9 | – | 70.4 | 62.2 | 60.8 | 78.4 | 71.6 | 62.2 | 58.8 | – | 73.9 | 57.8 | 52.0 |
DEEPLOB | 83.4 | 71.1 | 55.0 | 57.0 | 72.8 | 62.4 | 60.4 |
\(\mathbf {62.0}\) | – | 70.8 | 62.7 |
\(\mathbf {62.4}\) | 80.4 | 75.4 | 62.2 | 60.8 | – | 77.6 | 57.4 | 55.2 |
DAIN | 68.3 | 53.9 | 47.7 | 52.2 | 65.3 | 46.7 | 56.6 | 54.9 | – | 53.5 | 59.1 | 55.8 | – | 61.2 | 60.0 | 56.5 | – | 62.8 | 56.1 | 51.2 |
CNNLSTM | 47.0 | 63.5 | 51.8 | 55.0 | – | 49.1 | 58.1 | 59.8 | – | 63.3 | 59.9 | 59.2 | 47.0 | 69.2 | 60.1 | 57.1 | 47.0 | 71.0 | 55.3 | 53.1 |
CNN2 | 46.0 | 27.6 | 49.9 | 51.9 | – | 35.4 | 55.9 | 59.0 | – | 53.2 | 58.9 | 58.7 | 45.0 | 67.9 | 58.8 | 57.3 | 44.0 | 68.5 | 54.0 | 52.0 |
TRANSLOB |
\(\mathbf {88.7}\) | 61.4 | 53.8 | 43.7 |
\(\mathbf {80.6}\) | 54.7 | 57.8 | 43.0 | – | 59.8 | 60.7 | 57.5 |
\(\mathbf {88.2}\) | 60.6 | 60.3 | 56.6 |
\(\mathbf {91.6}\) | 60.5 | 55.8 | 51.0 |
TLONBoF | 53.0 | 36.5 | 52.5 | 53.1 | – | 51.7 | 58.0 | 56.5 | – | 41.6 | 60.1 | 57.1 | – | 52.4 | 59.9 | 55.7 | – | 66.2 | 56.0 | 48.5 |
BINCTABL | 81.0 |
\(\mathbf {81.1}\) |
\(\mathbf {57.0}\) |
\(\mathbf {58.4}\) | 71.2 |
\(\mathbf {71.5}\) |
\(\mathbf {62.4}\) |
\(\mathbf {62.0}\) | – |
\(\mathbf {80.8}\) |
\(\mathbf {63.9}\) | 62.2 | 88.1 |
\(\mathbf {87.7}\) |
\(\mathbf {63.5}\) | 60.4 | – |
\(\mathbf {92.1}\) |
\(\mathbf {59.1}\) | 53.2 |
DEEPLOBATT | 82.4 | 70.6 | 54.8 | 55.8 | 73.7 | 54.8 | 61.1 | 60.5 | 76.9 | 66.0 | 62.6 | 62.1 | 79.4 | 73.6 | 62.8 |
\(\mathbf {60.9}\) | 81.5 | 71.6 | 59.0 |
\(\mathbf {55.3}\) |
DLA | 77.8 | 79.4 | 51.2 | 54.4 | – | 69.3 | 58.6 | 58.0 | 79.4 | 78.9 | 61.3 | 60.0 | 79.0 | 87.1 | 60.3 | 57.3 | – | 52.2 | 57.1 | 53.4 |
ATNBoF | 67.9 | 32.9 | 49.8 | 47.8 | 60.0 | 34.2 | 53.1 | 50.3 | – | 38.2 | 54.6 | 41.3 | 73.4 | 48.1 | 57.2 | 59.8 | – | 51.0 | 50.9 | 40.9 |
AXIALLOB | 85.1 | 73.2 | 54.0 | 56.9 | 75.8 | 63.4 | 60.7 | 60.1 |
\(\mathbf {80.1}\) | 72.8 | 62.6 | 62.0 | 83.3 | 78.3 | 62.4 | 59.6 | 85.9 | 79.2 | 57.8 | 54.6 |
METALOB | – | 81.1 | 51.1 | 52.3 | – | 70.5 | 56.1 | 53.3 | – | 80.3 | 57.8 | 55.3 | – | 87.5 | 58.4 | 54.5 | – | 91.8 | 56.0 | 50.9 |
MAJORITY | – | 47.1 | 51.8 | 50.6 | – | 44.9 | 56.2 | 49.2 | – | 59.7 | 57.5 | 48.1 | – | 71.8 | 56.9 | 46.7 | – | 76.3 | 55.2 | 44.7 |
6.3.3 Ensemble method discussion
6.4 Additional experiments: labeling, non-DL models & profit
6.4.1 Labelling
6.4.2 Random forest & XGBoost for SPTP
Hyper Parameter | Values |
---|---|
n_estimators | [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000] |
max_depth | [10, 25, 50, 75, 100] |
min_samples_leaf | [1, 2, 4] |
min_samples_split | [2, 5, 10] |
temporal shape (h) | [5, 10, 15, 50, 100, 300] |
Hyper Parameter | Values |
---|---|
n_estimators | [100, 250, 500, 750, 1000, 1250] |
max_depth | [3, 4, 5, 6, 7, 8, 9, 10] |
booster | gbtree |
eta | [0.01, 0.1, 0.2, 0.3, 0.4] |
min_child_weight | [0, 2, 4, 6, 8] |
colsample_bytree | [0.5, 0.75] |
colsample_bylevel | [0.5, 0.75] |
temporal shape (h) | [5, 10, 15, 50, 100, 300] |