2021  OriginalPaper  Chapter Open Access
Opening the Black Box: Machine Learning Interpretability and Inference Tools with an Application to Economic Forecasting
Authors: Marcus Buckmann, Andreas Joseph, Helena Robertson
Publisher: Springer International Publishing
1 Introduction
2 Data and Experimental Setup
2.1 Data
Variable

Transformation

Name in the FREDMD database


Unemployment

Changes

UNRATE

3month treasury bill

Changes

TB3MS

Slope of the yield curve

Changes

–

Real personal income

Log changes

RPI

Industrial production

Log changes

INDPRO

Consumption

Log changes

DPCERA3M086SBEA

S&P 500

Log changes

S&P 500

Business loans

Secondorder log changes

BUSLOANS

CPI

Secondorder log changes

CPIAUCSL

Oil price

Secondorder log changes

OILPRICEx

M2 Money

Secondorder log changes

M2SL

Corr.

MAE

RMSE (normalized by first row)



01/1990–

01/1990–

01/1990–

01/2000–

09/2008–


11/2019

11/2019

12/1999

08/2008

11/2019


Random forest

0.609

1.000

1.000

1.000

1.000

1.000

Neural network

0.555

1.009

1.049

0.969

0.941

1.114**

Linear regression

0.521

1.094***

1.082**

1.011

0.959

1.149***

Lasso regression

0.519

1.094***

1.083***

1.007

0.949

1.156***

Ridge regression

0.514

1.099***

1.087***

1.019

0.952

1.157***

SVR

0.475

1.052

1.105**

1.000

1.033

1.169**

AR

0.383

1.082(*)

1.160(***)

1.003

1.010

1.265(***)

Linear regression (lagged response)

0.242

1.163***

1.226***

1.027

1.057

1.352***

2.2 Models

The simple linear lag model only uses the 1year lag of the outcome variable as a predictor: \(\hat {y}_i = \alpha + \theta _0 y_{i12}\).

The autoregressive model (AR) uses several lags of the response as predictors: \({\hat {y}_i = \alpha + \sum _{l = 1}^{h} \theta _i y_{il}}\). We test AR models with a horizon 1 ≤ h ≤ 12, chosen by the Akaike Information Criterion [ 1].

The full information models use the 1year lag of the outcome and 1year lags of the other features as independent variables: \(\hat {y}_t = f(y_{i12}; x_{i12}\)), where f can be any prediction model. For example, if f is a linear regression, \(f(y_i,x_i) = \alpha + \theta _0y_{i12} + \sum _{k= 1}^{n} \theta _kx_{i12,k}\). To simplify this notation we imply that the lagged outcome is included in the feature matrix x in the following. We test five full information models: Ordinary least squares regression and Lasso regularized regression [ 46], and three machine learning regressors—random forest [ 7], support vector regression [ 16], and artificial neural networks [ 22]. ^{5}
2.3 Experimental Procedure
3 Forecasting Performance
3.1 Baseline Setting
3.2 Robustness Checks

Window size. In the baseline setup, the training set grows over time (expanding window). This can potentially improve the performance over time as more observations may facilitate a better approximation of the true data generating process. On the other hand, it may also make the model sluggish and prevent quick adaptation to structural changes. We test sliding windows of 60, 120, and 240 months. Only the simplest model, linear regression with only a lagged response, profits from a short horizon; the remaining models perform best with the biggest possible training set. This is not surprising for machine learning models, as they can “memorize” different sets of information through the incorporation of multiple specification in the same model. For instance, different paths down a tree model, or different trees in a forest, are all different submodels, e.g., characterizing different time periods in our setting. By contrast, a simple linear model cannot adjust in this way and needs to fit the best hyperplane to the current situation, explaining its improved performance for some fixed window sizes.

Change horizon. In the baseline setup, we use a horizon of 3 months, when calculating changes, log changes, and secondorder log changes of the predictors (see Table 1). Testing the horizons of 1, 6, 9, and 12 months, we find that 3 months generally leads to the best performance of all full information models. This is useful from a practical point of view, as quarterly changes are one of the main horizons considered for shortterm economic projections.

Bootstrap aggregation (bagging). The linear regression, neural network, and SVR all benefit from averaging the prediction of 100 bootstrapped models. The intuition is that our relatively small dataset likely leads to models with high variance, i.e., overfitting. The bootstrap aggregation of models reduces the models’ variance and the degree of overfitting. Note that we do not expect much improvement for bagged linear models, as different draws from the training set are likely to lead to similar slope parameters resulting in almost identical models. This is confirmed by the almost identical performance of the single and bagged model.
4 Model Interpretability
4.1 Methodology
4.1.1 Permutation Importance
4.1.2 Shapley Values and Regressions
4.2 Results
4.2.1 Feature Importance
4.2.2 Shapley Regressions
Random forest

Linear regression



β
^{S}

pvalue

Γ
^{S}

β
^{S}

pvalue

Γ
^{S}


Industrial production

0.626

0.000

−0.228***

0.782

0.000

−0.163***

S&P 500

0.671

0.000

−0.177***

0.622

0.000

−0.251***

Consumption

1.314

0.000

−0.177***

2.004

0.000

−0.115***

Unemployment

1.394

0.000

+ 0.112***

2.600

0.010

+ 0.033***

Business loans

2.195

0.000

−0.068***

2.371

0.024

−0.031**

3month treasury bill

1.451

0.008

−0.066***

−1.579

1.000

−0.102

Personal income

−0.320

0.749

+ 0.044

−0.244

0.730

+ 0.089

Oil price

1.589

0.018

−0.040**

−0.246

0.624

−0.052

M2 Money

0.168

0.363

−0.034

−4.961

0.951

−0.011

Yield curve slope

1.952

0.055

+ 0.029*

0.255

0.171

+ 0.132

CPI

0.245

0.419

−0.024

−0.790

0.673

−0.022
