2021  OriginalPaper  Chapter Open Access
Do the Hype of the Benefits from Using New Data Science Tools Extend to Forecasting Extremely Volatile Assets?
Published in:
Data Science for Economics and Finance
1 Introduction
Over the past few years, the hype surrounding words ranging from big data to data science to machine learning has increased from already high levels. This hype arises in part from three sets of discoveries. Machine learning tools have repeatedly been shown in the academic literature to outperform statistical and econometric techniques for forecasting.
^{1} Further, tools developed in the natural language processing literature that are used to extract population sentiment measures have also been found to help forecast the value of financial indices. This set of finding is consistent with arguments in the behavioral finance literature (see [
23], among others) that the sentiment of investors can influence stock market activity. Last, issues surrounding data security and privacy have grown among the population as a whole, leading governments to consider blockchain technology for uses beyond what it was initially developed for.
Blockchain technology was originally developed for the cryptocurrency Bitcoin, an asset that can be continuously traded and whose value has been quite volatile. This volatility may present further challenges for forecasts by either machine learning algorithms or econometric strategies. Adding to these challenges is that unlike almost every other financial asset, Bitcoin is traded on both the weekend and holidays. As such, modeling the estimated daily realized variance of Bitcoin in US dollars presents an additional challenge. Many measures of conventional economic and financial data commonly used as predictors are not collected at the same points in time. However, since the behavioral finance literature has linked population sentiment measures to the price of different financial assets, we propose measuring and incorporating social media sentiment as an explanatory variable in the forecasting model. As an explanatory predictor, social media sentiment can be measured continuously providing a chance to capture and forecast the variation in the prices at which trades for Bitcoin are made.
Advertisement
In this chapter, we consider forecasts of Bitcoin realized volatility to first provide an illustration of the benefits in terms of forecast accuracy of using machine learning relative to traditional econometric strategies. While prior work contrasting approaches to conduct a forecast found that machine learning does provide gains primarily from relaxing the functional form assumptions that are made explicit when writing up an econometric model, those studies did not consider predicting an outcome that exhibits a degree of volatility of the magnitude of Bitcoin.
Determining strategies that can improve volatility forecasts is of significant value since they have come to play a large role in decisions ranging from asset allocation to derivative pricing and risk management. That is, volatility forecasts are used by traders as a component of their valuation procedure of any risky asset’s value (e.g., stock and bond prices), since the procedure requires assessing the level and riskiness of future payoffs. Further, their value to many investors arises when using a strategy that adjust their holdings to equate the risk stemming from the different investments included in a portfolio. As such, more accurate volatility forecasts can provide valuable actionable insights for market participants. Finally, additional motivation for determining how to obtain more accurate forecasts comes from the financial media who frequently report on market volatility since it is hypothesized to have an impact on public confidence and thereby can have a significant effect on the broader global economy.
There are many approaches that could be potentially used to undertake volatility forecasts, but each requires an estimate of volatility. At present, the most popular method used in practice to estimate volatility was introduced by Andersen and Bollerslev [
1] who proposed using the realized variance, which is calculated as the cumulative sum of squared intraday returns over short time intervals during the trading day.
^{2} Realized volatility possesses a slowly decaying autocorrelation function, sometimes known as long memory.
^{3} Various econometric models have been proposed to capture the stylized facts of these highfrequency time series models including the autoregressive fractionally integrated moving average (ARFIMA) models of Andersen et al. [
3] and the heterogeneous autoregressive (HAR) model proposed by Corsi [
11]. Compared with the ARFIMA model, the HAR model rapidly gained popularity, in part due to its computational simplicity and excellent outofsample forecasting performance.
^{4}
In our empirical exercise, we first use wellestablished machine learning techniques within the HAR framework to explore the benefits of allowing for general nonlinearities with recursive partitioning methods as well as sparsity using the least absolute shrinkage and selection operator (LASSO) of Tibshirani [
39]. We consider alternative ensemble recursive partitioning methods including bagging and random forest that each place equal weight on all observations when making a forecast, as well as boosting that places alternative weight based on the degree of fit. In total, we evaluate nine conventional econometric methods and five easytoimplement machine learning methods to model and forecast the realized variance of Bitcoin measured in US dollars.
Advertisement
Studies in the financial econometric literature have reported that a number of different variables are potentially relevant for the forecasting of future volatility. A secondary goal of our empirical exercise is to determine if there are gains in forecast accuracy of realized volatility by incorporating a measure of social media sentiment. We contrast forecasts using models that both include and exclude social media sentiment. This additional exercise allows us to determine if this measure provides information that is not captured by either the assetspecific realized volatility histories or other explanatory variables that are often included in the information set.
Specifically, in our application social media sentiment is measured by adopting a deep learning algorithm introduced in [
17]. We use a random sample of 10% of all tweets posted from users based in the United States from the Twitterverse collected at the minute level. This allows us to calculate a sentiment score that is an equal tweet weight average of the sentiment values of the words within each Tweet in our sample at the minute level.
^{5} It is well known that there are substantial intraday fluctuations in social media sentiment but its weekly and monthly aggregates are much less volatile. This intraday volatility may capture important information and presents an additional challenge when using this measure for forecasting since the Bitcoin realized variance is measured at the daily level, a much lower time frequency than the minutelevel sentiment index that we refer to as the US Sentiment Index (USSI). Rather than make ad hoc assumptions on how to aggregate the USSI to the daily level, we follow Lehrer et al. [
28] and adopt the heterogeneous mixed data sampling (HMIDAS) method that constructs empirical weights to aggregate the highfrequency social media data to a lower frequency.
Our analysis illustrates that sentiment measures extracted from Twitter can significantly improve forecasting efficiency. The gains in forecast accuracy as pseudo Rsquared increased by over 50% when social media sentiment was included in the information set for all of the machine learning and econometric strategies considered. Moreover, using four different criteria for forecast accuracy, we find that the machine learning techniques considered tend to outperform the econometric strategies and that these gains arise by incorporating nonlinearities. Among the 16 methods considered in our empirical exercise, both bagging and random forest yield the highest forecast accuracy. Results from the [
18] test indicate that the improvements that each of these two algorithms offers are statistically significant at the 5% level, yet the difference between these two algorithms is indistinguishable.
For practitioners, our empirical exercise also contains exercises including examining the sensitivity of our findings to the choices of hyperparameters made when implementing any machine learning algorithm. This provides value since the settings of the hyperparameters with any machine learning algorithm can be thought of in an analogous manner to model selection in econometrics. For example, with the random forest algorithm, numerous hyperparameters can be adjusted by the researcher including the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. Further, Probst and Boulesteix provide evidence that the benefits from changing hyperparameters differ across machine learning algorithms and are higher with the support vector regression than the random forest algorithm we employ. In our analysis, the default values of the hyperparameters specified in software packages work reasonably well, but we stress a caveat that our investigation was not exhaustive so there remains a possibility that there are particular specific combinations of hyperparameters with each algorithm that may lead to changes in the ordering of forecast accuracy in the empirical horse race presented. Thus, there may be a set of hyperparameters where the winning algorithms have a distinguishable different effect from the others that it is being compared to.
This chapter is organized as follows. In the next section, we briefly describe Bitcoin. Sections
3 and
4 provide a more detailed overview of existing HAR strategies as well as conventional machine learning algorithms. Section
5 describes the data we utilize and explains how we measure and incorporate social media data into our empirical exercise. Section
6 presents our main empirical results that compare the forecasting performance of each method introduced in Sects.
3 and
4 in a rolling window exercise. To focus on whether social media sentiment data adds value, we contrast the results of incorporating the USSI variable in each strategy to excluding this variable from the model. For every estimator considered, we find that incorporating the USSI variable as a covariate leads to significant improvements in forecast accuracy. We examine the robustness of our results by considering (1) different experimental settings, (2) different hyperparameters, and (3) incorporating covariates on the value of mainstream assets, in Sect.
7. We find that our main conclusions are robust to both changes in the hyperparameters and various settings, as well as little benefits from incorporating mainstream asset markets when forecasting the realized volatility in the value of Bitcoin. Section
8 concludes by providing additional guidance to practitioners to ensure that they can gain the full value of the hype for machine learning and social media data in their applications.
2 What Is Bitcoin?
Bitcoin, the first and still one of the most popular applications of the blockchain technology by far, was introduced in 2008 by a person or group of people known by the pseudonym, Satoshi Nakamoto. Blockchain technology allows digital information to be distributed but not copied. Basically, a timestamped series of immutable records of data are managed by a cluster of computers that are not owned by any single entity. Each of these blocks of data (i.e., block) is secured and bound to each other using cryptographic principles (i.e., chain). The blockchain network has no central authority and all information on the immutable ledger is shared. The information on the blockchain is transparent and each individual involved is accountable for their actions.
The group of participants who uphold the blockchain network ensure that it can neither be hacked or tampered with. Additional units of currency are created by the nodes of a peertopeer network using a generation algorithm that ensures decreasing supply that was designed to mimic the rate at which gold was mined. Specifically, when a user/miner discovers a new block, they are currently awarded 12.5 Bitcoins. However, the number of new Bitcoins generated per block is set to decrease geometrically, with a 50% reduction every 210,000 blocks. The amount of time it takes to find a new block can vary based on mining power and the network difficulty.
^{6} This process is why it can be treated by investors as an asset and ensures that causes of inflation such as printing more currency or imposing capital controls by a central authority cannot take place. The latter monetary policy actions motivated the use of Bitcoin, the first cryptocurrency as a replacement for fiat currencies.
Bitcoin is distinguished from other major asset classes by its basis of value, governance, and applications. Bitcoin can be converted to a fiat currency using a cryptocurrency exchange, such as Coinbase or Kraken, among other online options. These online marketplaces are similar to the platforms that traders use to buy stock. In September 2015, the Commodity Futures Trading Commission (CFTC) in the United States officially designated Bitcoin as a commodity. Furthermore, the Chicago Mercantile Exchange in December 2017 launched a Bitcoin future (XBT) option, using Bitcoin as the underlying asset. Although there are emerging cryptofocused funds and other institutional investors,
^{7} this market remains retail investor dominated.
^{8}
There is substantial volatility in BTC/USD, and the sharp price fluctuations in this digital currency greatly exceed that of most other fiat currencies. Much research has explored why Bitcoin is so volatile; our interest is strictly to examine different empirical strategies to forecast this volatility, which greatly exceeds that of other assets including most stocks and bonds.
3 Bitcoin Data and HARType Strategies to Forecast Volatility
The price of Bitcoin is often reported to experience wild fluctuations. We follow Xie [
42] who evaluates model averaging estimators with data on the Bitcoin price in US dollars (henceforth BTC/USD) at a 5min. frequency between May 20, 2015, and Aug 20, 2017. This data was obtained from Poloniex, one of the largest USbased digital asset exchanges. Following Andersen and Bollerslev [
1], we estimate the daily realized volatility at day
t (RV
_{t}) by summing the corresponding
M equally spaced intradaily squared returns
r
_{t,j}. Here, the subscript
t indexes the day, and
j indexes the time interval within day
t:
where
t = 1, 2, …,
n,
j = 1, 2, …,
M, and
r
_{t,j} is the difference between logprices
p
_{t,j} (
r
_{t,j} =
p
_{t,j} −
p
_{t,j−1}). Poloniex is an active exchange that is always in operation, every minute of each day in the year. We define a trading day using Eastern Standard Time and with data calculate realized volatility of BTC/USD for 775 days. The evolution of the RV data over this full sample period is presented in Fig.
1.
$$\displaystyle \begin{aligned} \text{RV}_{t}\equiv \sum_{j=1}^{M}r_{t,j}^{2} \end{aligned} $$
(1)
×
In this section, we introduce some HARtype strategies that are popular in modeling volatility. The standard HAR model of Corsi [
11] postulates that the
hstepahead daily RV
_{t+h} can be modeled by
^{9}
$$\displaystyle \begin{aligned} log\text{RV}_{t+h}=\beta _{0}+\beta _{d}log\text{RV}_{t}^{(1)}+\beta _{w}log \text{RV}_{t}^{(5)}+\beta _{m}log\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{aligned} $$
(2)
where the
βs are the coefficients and {
e
_{t}}
_{t} is a zero mean innovation process. The explanatory variables take the general form of
\(log \text{RV}_{t}^{(l)}\) that is defined as the
l period averages of daily log RV:
$$\displaystyle \begin{aligned} log\text{RV}_{t}^{(l)}\equiv {l}^{1}\sum_{s=1}^{l}log\text{RV}_{ts}. \end{aligned} $$
Another popular formulation of the HAR model in Eq. (
2) ignores the logarithmic form and considers
$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}\text{RV}_{t}^{(1)}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{aligned} $$
(3)
where
\(\text{RV}_{t}^{(l)}\equiv {l}^{1}\sum _{s=1}^{l}\text{RV}_{ts}\).
In an important paper, Andersen et al. [
4] extend the standard HAR model from two perspectives. First, they added a daily jump component (J
_{t}) to Eq. (
3). The extended model is denoted as the HARJ model:
$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}\text{RV}_{t}^{(1)}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+\beta ^{j}\text{J}_{t}+e_{t+h}, {} \end{aligned} $$
(4)
where the empirical measurement of the squared jumps is
\(\text{J}_{t}=\max (\text{RV}_{t}\text{BPV}_{t},0)\) and the standardized realized bipower variation (BPV) is defined as
$$\displaystyle \begin{aligned} \text{BPV}_{t}\equiv (2/\pi )^{1}\sum_{j=2}^{M}r_{t,j1}r_{t,j}. \end{aligned}$$
Second, through a decomposition of RV into the continuous sample path and the jump components based on the
Z
_{t} statistic [
22], Andersen et al. [
4] extend the HARJ model by explicitly incorporating the two types of volatility components mentioned above. The
Z
_{t} statistic respectively identifies the “significant” jumps CJ
_{t} and continuous sample path components CSP
_{t} by
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{CSP}_{t} & \equiv &\displaystyle \mathbb{I}(Z_{t}\leq \varPhi _{\alpha })\cdot \text{RV} _{t}+\mathbb{I}(Z_{t}> \varPhi _{\alpha })\cdot \text{BPV}_{t}, \\ \text{CJ}_{t} & =&\displaystyle \mathbb{I}(Z_{t}>\varPhi _{\alpha })\cdot (\text{RV}_{t} \text{BPV}_{t}). \end{array} \end{aligned} $$
where
Z
_{t} is the ratiostatistic defined in [
22] and
Φ
_{α} is the cumulative distribution function(CDF) of a standard Gaussian distribution with
α level of significance. The daily, weekly, and monthly average components of CSP
_{t} and CJ
_{t} are then constructed in the same manner as RV
^{(l)}. The model specification for the continuous HARJ, namely, HARCJ, is given by
$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}^{c}\text{CSP}_{t}^{(1)}+\beta _{w}^{c} \text{CSP}_{t}^{(5)}+\beta _{m}^{c}\text{CSP}_{t}^{(22)}+\beta _{d}^{j}\text{ CJ}_{t}^{(1)}+\beta _{w}^{j}\text{CJ}_{t}^{(5)}+\beta _{m}^{j}\text{CJ} _{t}^{(22)}+e_{t+h}. {} \end{aligned} $$
(5)
Note that compared with the HARJ model, the HARCJ model explicitly controls for the weekly and monthly components of continuous jumps. Thus, the HARJ model can be treated as a special and restrictive case of the HARCJ model for
$$\displaystyle \begin{aligned}\beta _{d}=\beta _{d}^{c}+\beta _{d}^{j}, \beta ^{j}=\beta _{d}^{j}, \beta _{w}=\beta _{w}^{c}+\beta _{w}^{j},\ \text{and}\ \beta _{m}=\beta _{m}^{c}+\beta _{m}^{j}.\end{aligned} $$
To capture the role of the “leverage effect” in predicting volatility dynamics, Patton and Sheppard [
34] develop a series of models using signed realized measures. The first model, denoted as HARRSI, decomposes the daily RV in the standard HAR model (
3) into two asymmetric semivariances
\(\text{ RS}_{t}^{+}\) and
\(\text{RS}_{t}^{}\):
$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}^{+}\text{RS}_{t}^{+}+\beta _{d}^{} \text{RS}_{t}^{}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {}\end{aligned} $$
(6)
where
\(\text{RS}_{t}^{}=\sum _{j=1}^{M}r_{t,j}^{2}\cdot \mathbb {I} (r_{t,j}<0) \) and
\(\text{RS}_{t}^{+}=\sum _{j=1}^{M}r_{t,j}^{2}\cdot \mathbb {I }(r_{t,j}>0) \). To verify whether the realized semivariances add something beyond the classical leverage effect, Patton and Sheppard [
34] augment the HARRSI model with a term interacting the lagged RV with an indicator for negative lagged daily returns
\(\text{RV}_{t}^{(1)}\cdot \mathbb {I}(r_{t}<0)\) . The second model in Eq. (
7) is denoted as HARRSII:
$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{1}\text{RV}_{t}^{(1)}\cdot \mathbb{I} (r_{t}<0)+\beta _{d}^{+}\text{RS}_{t}^{+}+\beta _{d}^{}\text{RS} _{t}^{}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {} \end{aligned} $$
(7)
where
\(\text{RV}_{t}^{(1)}\cdot \mathbb {I}(r_{t}<0)\) is designed to capture the effect of negative daily returns. As in the HARCJ model, the third and fourth models in [
34], denoted as HARSJI and HARSJII, respectively, disentangle the signed jump variations and the BPV from the volatility process:
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{RV}_{t+h} & =&\displaystyle \beta _{0}+\beta _{d}^{j}\text{SJ}_{t}+\beta _{d}^{bpv} \text{BPV}_{t}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {} \end{array} \end{aligned} $$
(8)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{RV}_{t+h} & =&\displaystyle \beta _{0}+\beta _{d}^{j}\text{SJ}_{t}^{}+\beta _{d}^{j+}\text{SJ}_{t}^{+}+\beta _{d}^{bpv}\text{BPV}_{t}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{array} \end{aligned} $$
(9)
where
\(\text{SJ}_{t}=\text{RS}_{t}^{+}\text{RS}_{t}^{}\),
\(\text{SJ} _{t}^{+}=\text{SJ}_{t}\cdot \mathbb {I}(\text{SJ}_{t}>0)\), and
\(\text{SJ} _{t}^{}=\text{SJ}_{t}\cdot \mathbb {I}(\text{SJ}_{t}<0)\). The HARSJII model extends the HARSJI model by being more flexible to allow the effect of a positive jump variation to differ in unsystematic ways from the effect of a negative jump variation.
The models discussed above can be generalized using the following formulation in practice:
$$\displaystyle \begin{aligned} y_{t+h}=\boldsymbol{x}_{t}\boldsymbol{\beta }+e_{t+h} \end{aligned}$$
for
t = 1, …,
n, where
y
_{t+h} stands for RV
_{t+h} and variable
x
_{t} collects all the explanatory variables such that
$$\displaystyle \begin{aligned} \boldsymbol{x}_{t}\equiv \left\{ \begin{array}{ll} \big[1,\text{RV}_{t}^{(1)},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HAR in (3)}, \\ \big[1,\text{RV}_{t}^{(1)},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)},\text{J} _{t}\big] & \text{for model HARJ in (4)}, \\ \big[1,\text{CSP}_{t}^{(1)},\text{CSP}_{t}^{(5)},\text{CSP}_{t}^{(22)},\text{ CJ}_{t}^{(1)},\text{CJ}_{t}^{(5)},\text{CJ}_{t}^{(22)}\big] & \text{for model HARCJ in (5)}, \\ \big[1,\text{RS}_{t}^{},\text{RS}_{t}^{+},\text{RV}_{t}^{(5)},\text{RV} _{t}^{(22)}\big] & \text{for model HARRSI in (6)}, \\ \big[1,\text{RV}_{t}^{(1)}\mathbb{I}_{{r_{t}<0}},\text{RS}_{t}^{},\text{RS} _{t}^{+},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HARRSII in (7)}, \\ \big[1,\text{SJ}_{t},\text{BPV}_{t},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)} \big] & \text{for model HARSJI in (8)}, \\ \big[1,\text{SJ}_{t}^{},\text{SJ}_{t}^{+},\text{BPV}_{t},\text{RV} _{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HARSJII in (9)}. \end{array} \right.\end{aligned} $$
Since
y
_{t+h} is infeasible in period
t, in practice, we usually obtain the estimated coefficient
\(\hat {\boldsymbol {\beta }}\) from the following model:
$$\displaystyle \begin{aligned} y_{t}=\boldsymbol{x}_{th}\boldsymbol{\beta }+e_{t}, {}\end{aligned} $$
(10)
in which both the independent and dependent variables are feasible in period
t = 1, …,
n. Once the estimated coefficients
\(\hat {\boldsymbol {\beta }}\) are obtained, the
hstepahead forecast can be estimated by
$$\displaystyle \begin{aligned} \hat y_{t+h}= \boldsymbol{x}_{t}\hat{\boldsymbol{\beta }}\ \text{for}\ t=1,\ldots,n.\end{aligned}$$
4 Machine Learning Strategy to Forecast Volatility
Machine learning tools are increasingly being used in the forecasting literature.
^{10} In this section, we briefly describe five of the most popular machine learning algorithms that have been shown to outperform econometric strategies when conducting forecast. That said, as Lehrer and Xie [
26] stress the “No Free Lunch” theorem of Wolpert and Macready [
41] indicates that in practice, multiple algorithms should be considered in any application.
^{11}
The first strategy we consider was developed to assist in the selection of predictors in the main model. Consider the regression model in Eq. (
10), which contains many explanatory variables. To reduce the dimensionality of the set of the explanatory variables, Tibshirani [
39] proposed the LASSO estimator of
\(\hat {\boldsymbol { \beta }}\) that solves
(11)
where
λ is a tuning parameter that controls the penalty term. Using the estimates of Eq. (
11), the
hstepahead forecast is constructed in an identical manner as OLS:
$$\displaystyle \begin{aligned} \hat{y}_{t+h}^{\text{LASSO}}=\boldsymbol{x}_{t}\hat{\boldsymbol{\beta }}^{ \text{LASSO}}. \end{aligned}$$
The LASSO has been used in many applications and a general finding is that it is more likely to offer benefits relative to the OLS estimator when either (1) the number of regressors exceeds the number of observations, since it involves shrinkage, or (2) the number of parameters is large relative to the sample size, necessitating some form of regularization.
Recursive partitioning methods do not model the relationship between the explanatory variables and the outcome being forecasted with a regression model such as Eq. (
10). Breiman et al. [
10] propose a strategy known as classification and regression trees (CART), in which classification is used to forecast qualitative outcomes including categorical responses of nonnumeric symbols and texts, and regression trees focus on quantitative response variables. Given the extreme volatility in Bitcoin gives rise to a continuous variable, we use regression trees (RT).
Consider a sample of
\(\{y_{t},\boldsymbol {x}_{th}\}_{t=1}^{n}\). Intuitively, RT operates in a similar manner to forward stepwise regression. A fast divide and conquer greedy algorithm considers all possible splits in each explanatory variable to recursively partition the data. Formally, at node
τ containing
n
_{τ} observations with mean outcome
\(\overline {y }(\tau )\) of the tree can only be split by one selected explanatory variable into two leaves, denoted as
τ
_{L} and
τ
_{R}. The split is made at the explanatory variable which will lead to the largest reduction of a predetermined loss function between the two regions.
^{12} This splitting process continues at each new node until the gain to any forecast adds little value relative to a predetermined boundary. Forecasts at each final leaf are the fitted value from a local constant regression model.
Among machine learning strategies, the popularity of RT is high since the results of the analysis are easy to interpret. The algorithm that determines the split allows partitions among the entire covariate set to be described by a single tree. This contrasts with econometric approaches that begin by assuming a linear parametric form to explain the same process and as with the LASSO build a statistical model to make forecasts by selecting which explanatory variables to include. The tree structure considers the full set of explanatory variables and further allows for nonlinear predictor interactions that could be missed by conventional econometric approaches. The tree is simply a topdown, flowchartlike model which represents how the dataset was partitioned into numerous final leaf nodes. The predictions of a RT can be represented by a series of discontinuous flat surfaces forming an overall rough shape, whereas as we describe below visualizations of forecasts from other machine learning methods are not intuitive.
If the data are stationary and ergodic, the RT method often demonstrates gains in forecasting accuracy relative to OLS. Intuitively, we expect the RT method to perform well since it looks to partition the sample into subgroups with heterogeneous features. With time series data, it is likely that these splits will coincide with jumps and structural breaks. However, with primarily crosssectional data, the statistical learning literature has discovered that individual regression trees are not powerful predictors relative to ensemble methods since they exhibit large variance [
21].
Ensemble methods combine estimates from multiple outputs. Bootstrap aggregating decision trees (aka bagging) proposed in [
8] and random forest (RF) developed in [
9] are randomizationbased ensemble methods. In bagging trees (BAG), trees are built on random bootstrap copies of the original data. The BAG algorithm is summarized as below:
(i)
Take a random sample with replacement from the data.
(ii)
Construct a regression tree.
(iii)
Use the regression tree to make forecast,
\(\hat f\).
(iv)
Repeat steps (i) to (iii),
b = 1, …,
B times and obtain
\(\hat f^b\) for each
b.
(v)
Take a simple average of the
B forecasts
\(\hat f_{\text{BAG}} = \frac {1}{B}\sum ^B_{b=1}\hat f^b \) and consider the averaged value
\(\hat f_{ \text{BAG}}\) as the final forecast.
Forecast accuracy generally increases with the number of bootstrap samples in the training process. However, more bootstrap samples increase computational time. RF can be regarded as a less computationally intensive modification of BAG. Similar to BAG, RF also constructs
B new trees with (conventional or moving block) bootstrap samples from the original dataset. With RF, at each node of every tree only a random sample (without replacement) of
q predictors out of the total
K (
q <
K) predictors is considered to make a split. This process is repeated and the remaining steps (iii)–(v) of the BAG algorithm are followed. Only if
q =
K, RF is roughly equivalent to BAG. RF forecasts involve
B trees like BAG, but these trees are less correlated with each other since fewer variables are considered for a split at each node. The final RF forecast is calculated as the simple average of forecasts from each of these
B trees.
The RT method can respond to highly local features in the data and is quite flexible at capturing nonlinear relationships. The final machine learning strategy we consider refines how highly local features of the data are captured. This strategy is known as boosting trees and was introduced in [
21, Chapter 10]. Observations responsible for the local variation are given more weight in the fitting process. If the algorithm continues to fit those observations poorly, we reapply the algorithm with increased weight placed on those observations.
We consider a simple least squares boosting that fits RT ensembles (BOOST). Regression trees partition the space of all joint predictor variable values into disjoint regions
R
_{j},
j = 1, 2, …,
J, as represented by the terminal nodes of the tree. A constant
j is assigned to each such region and the predictive rule is
X ∈
R
_{j} ⇒
f(
X) =
γ
_{j}, where
X is the matrix with
tth component
x
_{t−h}. Thus, a tree can be formally expressed as
\(T( \boldsymbol {X},\varTheta )=\sum _{j=1}^{J}\gamma _{j}\mathbb {I}(\boldsymbol {X} \in R_{j}),\) with parameters
\(\varTheta =\{R_{j},\gamma _{j}\}_{j=1}^{J}\). The parameters are found by minimizing the risk
where
\(\mathcal L(\cdot )\) is the loss function, for example, the sum of squared residuals (SSR).
The BOOST method is a sum of all trees:
$$\displaystyle \begin{aligned} f_M(\boldsymbol{X}) = \sum^M_{m=1}T(\boldsymbol{X};\varTheta_m) \end{aligned}$$
induced in a forward stagewise manner. At each step in the forward stagewise procedure, one must solve
(12)
for the region set and constants
\(\varTheta _m = \{R_{jm},\gamma _{jm}\}^{J_m}_1\) of the next tree, given the current model
f
_{m−1}(
X). For squarederror loss, the solution is quite straightforward. It is simply the regression tree that best predicts the current residuals
y
_{t} −
f
_{m−1}(
x
_{t−h}), and
\(\hat \gamma _{jm}\) is the mean of these residuals in each corresponding region.
A popular alternative to a treebased procedure to solve regression problems developed in the machine learning literature is the support vector regression (SVR). SVR has been found in numerous applications including Lehrer and Xie [
26] to perform well in settings where there a small number of observations (< 500). Support vector regression is an extension of the support vector machine classification method of Vapnik [
40]. The key feature of this algorithm is that it solves for a best fitting hyperplane using a learning algorithm that infers the functional relationships in the underlying dataset by following the structural risk minimization induction principle of Vapnik [
40]. Since it looks for a functional relationship, it can find nonlinearities that many econometric procedures may miss using a prior chosen mapping that transforms the original data into a higher dimensional space.
Support vector regression was introduced in [
16] and the true data that one wishes to forecast was known to be generated as
y
_{t} =
f(
x
_{t}) +
e
_{t}, where
f is unknown to the researcher and
e
_{t} is the error term. The SVR framework approximates
f(
x
_{t}) in terms of a set of basis functions:
\(\{h_s(\cdot )\}^S_{s=1}\):
$$\displaystyle \begin{aligned} y_{t}=f({ {x}}_{t})+e_t =\sum_{s=1}^{S}\beta _{s}h_{s}({ {x}}_{t})+e_{t}, \end{aligned}$$
where
h
_{s}(⋅) is implicit and can be infinitedimensional. The coefficients
β = [
β
_{1}, ⋯ ,
β
_{S}]
^{⊤} are estimated through the minimization of
$$\displaystyle \begin{aligned} H( {\beta })=\sum_{t=1}^{T}V_{\epsilon}\left( y_{t}f( {x} _{t})\right) +{\lambda }\sum_{s=1}^{S}\beta _{s}^{2}, {} \end{aligned} $$
(13)
where the loss function
$$\displaystyle \begin{aligned} V_{\epsilon}(r)=\left\{ \begin{array}{cl} 0 & \text{if }r<\epsilon \\ r\epsilon & \text{otherwise} \end{array} \right. \end{aligned}$$
is called an
𝜖insensitive error measure that ignores errors of size less than
𝜖. The parameter
𝜖 is usually decided beforehand and
λ can be estimated by crossvalidation.
Suykens and Vandewalle [
38] proposed a modification to the classic SVR that eliminates the hyperparameter
𝜖 and replaces the original
𝜖insensitive loss function with a least squares loss function. This is known as the least squares SVR (LSSVR). The LSSVR considers minimizing
$$\displaystyle \begin{aligned} H(\boldsymbol{\beta })=\sum_{t=1}^{T}\left( y_{t}f( {x} _{t})\right) ^{2}+{\lambda }\sum_{s=1}^{S}\beta _{s}^{2}, {} \end{aligned} $$
(14)
where a squared loss function replaces
V
_{e}(⋅) for the LSSVR.
Estimating the nonlinear algorithms (
13) and (
14) requires a kernelbased procedure that can be interpreted as mapping the data from the original input space into a potentially higherdimensional “feature space,” where linear methods may then be used for estimation. The use of kernels enables us to avoid paying the computational penalty implicit in the number of dimensions, since it is possible to evaluate the training data in the feature space through indirect evaluation of the inner products. As such, the kernel function is essential to the performance of SVR and LSSVR since it contains all the information available in the model and training data to perform supervised learning, with the sole exception of having measures of the outcome variable. Formally, we define the kernel function
K(
x,
x
_{t}) =
h(
x)
h(
x
_{t})
^{⊤} as the linear dot product of the nonlinear mapping for any input variable
x. In our analysis, we consider the Gaussian kernel (sometimes referred to as “radial basis function” and “Gaussian radial basis function” in the support vector literature):
$$\displaystyle \begin{aligned} K(\boldsymbol x, { {x}}_{t})=\exp \left( \frac{\Vert {\boldsymbol x  {x}}_{t}\Vert ^{2}}{2\sigma _{x}^{2}}\right), \end{aligned}$$
where the hyperparameters
\(\sigma _{x}^{2}\) and
γ.
In our main analysis, we use a tenfold crossvalidation to pick the tuning parameters for LASSO, SVR, and LSSVR. For treetype machine learning methods, we set the basic hyperparameters of a regression tree at their default values. These include but not limited to: (1) the split criterion is SSR; (2) the maximum number of split is 10 for BOOST and
n − 1 for others; (3) the minimum leaf size is 1; (4) the number of predictors for split is
K∕3 for RF and
K for others; and (5) the number of learning cycles is
B = 100 for ensemble learning methods. We examine the robustness to different values for the hyperparameters in Sect.
7.3.
5 Social Media Data
Substantial progress has been made in the machine learning literature on quickly converting text to data, generating realtime information on social media content. To measure social media sentiment, we selected an algorithm introduced in [
17] that pretrained a fivehiddenlayer neural model on 124.6 million tweets containing emojis in order to learn better representations of the emotional context embedded in the tweet. This algorithm was developed to provide a means to learn representations of emotional content in texts and is available with preprocessing code, examples of usage, and benchmark datasets, among other features at
http://www.github.com/bfelbo/deepmoji. The pretraining data is split into a training, validation, and test set, where the validation and test set are randomly sampled in such a way that each emoji is equally represented. This data includes all English Twitter messages without URLs within the period considered that contained an emoji. The fifth layer of the algorithm focuses on attention and takes inputs from the prior levels which uses a multiclass learners to decode the text and emojis itself. See [
17] for further details. Thus, an emoji is viewed as a labeling system for emotional content.
The construction of the algorithm began by acquiring a dataset of 55 billion tweets, of which all tweets with emojis were used to train a deep learning model. That is, the text in the tweet was used to predict which emoji was included with what tweet. The premise of this algorithm is that if it could understand which emoji was included with a given sentence in the tweet, then it has a good understanding of the emotional content of that sentence. The goal of the algorithm is to understand the emotions underlying from the words that an individual tweets. The key feature of this algorithm compared to one that simply scores words themselves is that it is better able to detect irony and sarcasm. As such, the algorithm does not score individual emotion words in a Twitter message, but rather calculates a score based on the probability of each of 64 different emojis capturing the sentiment in the full Twitter message taking the structure of the sentence into consideration. Thus, each emoji has a fixed score and the sentiment of a message is a weighted average of the type of mood being conveyed, since messages containing multiple words are translated to a set of emojis to capture the emotion of the words within.
In brief, for a random sample of 10% of all tweets every minute, the score is calculated as an equal tweet weight average of the sentiment values of the words within them.
^{13} That is, we apply the pretrained classifier of Felbo et al. [
17] to score each of these tweets and note that there are computational challenges related to data storage when using very large datasets to undertake sentiment analysis. In our application, the number of tweets per hour generally varies between 120,000 and 200,000 tweets per hour in our 10% random sample. We denote the minutelevel sentiment index as the U.S. Sentiment Index (USSI).
In other words, if there are 10,000 tweets each hour, we first convert each tweet to a set of emojis. Then we convert the emojis to numerical values based on a fixed mapping related to their emotional content. For each of the 10,000 tweets posted in that hour, we next calculate the average of these scores as the emotion content or sentiment of that individual tweet. We then calculate the equal weighted average of these tweetspecific scores to gain an hourly measure. Thus, each tweet is treated equally irrespective of whether one tweet contains more emojis than the other. This is then repeated for each hour of each day in our sample providing us with a large time series.
Similar to many other text mining tasks, this sentiment analysis was initially designed to deal with English text. It would be simple to apply an offtheshelf machine translation tool in the spirit of Google translate to generate pseudoparallel corpora and then learn bilingual representations for downstream sentiment classification task of tweets that were initially posted in different languages. That said, due to the ubiquitous usage of emojis across languages and their functionality of expressing sentiment, alternative emoji powered algorithms have been developed with other languages. These have smaller training datasets since most tweets are in English and it is an open question as to whether they perform better than applying the [
17] algorithm to pseudotweets.
Note that the way we construct USSI does not necessarily focus on sentiment related to cyptocurrency only as in [
29]. Sentiment, in and offmarket, has been a major factor affecting the price of financial asset [
23]. Empirical works have documented that large national sentiment swing can cause large fluctuation in asset prices, for example, [
5,
37]. It is therefore natural to assume that national sentiment can affect financial market volatility.
Data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. Since USSI is constructed at minute level, we convert the minutelevel USSI to match the daily sampling frequency of Bitcoin RV using the heterogeneous mixed data sampling (HMIDAS) method of Lehrer et al. [
28].
^{14} This allows us to transform 1,172,747 minutelevel observations for USSI variable via a step function to allow for heterogeneous effects of different highfrequency observations into 775 daily observations for the USSI at different forecast horizons. This step function produces a different weight on the hourly levels in the time series and can capture the relative importance of user’s emotional content across the day since the type of users varies in a manner that may be related to BTC volatility. The estimated weights used in the HMIDAS transformation for our application are presented in Fig.
2.
×
Last, Table
1 presents the summary statistics for the RV data and
p values from both the Jarque–Bera test for normality and the Augmented Dickey–Fuller (ADF) tests for unit root. We consider the first half sample, the second half sample, and full sample. Each of the series exhibits tremendous variability and a large range across the sample period. Further, none of the series are normally distributed or nonstationary at 5% level.
Table 1
Descriptive statistics
Statistics

Realized variance

USSI



First half

Second half

Full sample


Mean

43.4667

12.1959

27.8313

117.4024

Median

31.2213

7.0108

17.4019

125.8772

Maximum

197.6081

115.6538

197.6081

657.4327

Minimum

5.0327

0.5241

0.5241

− 866.6793

Std. dev.

38.0177

15.6177

32.9815

179.1662

Skewness

2.1470

3.3633

2.6013

− 0.8223

Kurtosis

7.8369

18.2259

11.2147

5.8747

Jarque–Bera

0.0000

0.0000

0.0000

0.0000

ADF test

0.0000

0.0000

0.0000

0.0000

Table 2
List of estimators
Panel A: conventional regression



(1)

AR(1)

A simple autoregressive model

(2)

HARFull

The HAR model proposed in [
11] with
l = [1, 2, …, 30], which is equivalent to AR(30)

(3)

HAR

The conventional HAR model proposed in [
11] with
l = [1, 7, 30]

(4)

HARJ

The HAR model with jump component proposed in [
4]

(5)

HARCJ

The HAR model with continuous jump component proposed in [
4]

(6)

HARRSI

The HAR model with semivariance components (Type I) proposed in [
34]

(7)

HARRSII

The HAR model with semivariance components (Type II) proposed in [
34]

(8)

HARSJI

The HAR model with semivariance and jump components (Type I) proposed in [
34]

(9)

HARSJII

The HAR model with semivariance and jump components (Type II) proposed in [
34]

Panel B: machine learning strategy


(10)

LASSO

The least absolute shrinkage and selection operator by Tibshirani [
39]

(11)

RT

The regression tree method proposed by Breiman et al. [
10]

(12)

BOOST

The boosting tree method described in [
21]

(13)

BAG

The bagging tree method proposed by Breiman [
8]

(14)

RF

The random forest method proposed by Breiman [
9]

(15)

SVR

The support vector machine for regression by Drucker et al. [
16]

(16)

LSSVR

The least squares support vector regression by Suykens and Vandewalle [
38]

6 Empirical Exercise
To examine the relative prediction efficiency of different HAR estimators, we conduct an
hstepahead rolling window exercise of forecasting the BTC/USD RV for different forecasting horizons.
^{15} Table
2 lists each estimator analyzed in the exercise. For all the HARtype estimators in Panel A (except the HARFull model which uses all the lagged covariates from 1 to 30), we set
l = [1, 7, 30] . For the machine learning methods in Panel B, the input data includes all covariates as the one for HARFull model. Throughout the experiment, the window length is fixed at
WL = 400 observations. Our conclusions are robust to other window lengths as discussed in Sect.
7.1.
Table 3
Forecasting performance of strategies in the main exercise
To examine if the sentiment data extracted from social media improves forecasts, we contrasted the forecast from models that exclude the USSI to models that include the USSI as a predictor. We denote methods incorporating the USSI variable with ∗ symbol in each table. The results of the prediction experiment are presented in Table
3. The estimation strategy is listed in the first column and the remaining columns present alternative criteria to evaluate the forecasting performance. The criteria include the mean squared forecast error (MSFE), quasilikelihood (QLIKE), mean absolute forecast error (MAFE), and standard deviation of forecast error (SDFE) that are calculated as
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{MSFE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}{e}^2_{T_{j},h}, {} \end{array} \end{aligned} $$
(15)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{QLIKE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}\left(\log\hat{y}_{T_{j},h}+\frac{y_{T_{j},h}}{\hat{y}_{T_{j},h} }\right), \end{array} \end{aligned} $$
(16)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{MAFE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}e_{T_{j},h} , {} \end{array} \end{aligned} $$
(17)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{SDFE}(h)& =&\displaystyle \sqrt{\frac{1}{V1}\left( e_{T_{j},h}\frac{1}{V} \sum_{j=1}^{V}e_{T_{j},h}\right) ^{2}}, {} \end{array} \end{aligned} $$
(18)
where
\(e_{T_{j},h}=y_{T_{j},h}\hat {y}_{T_{j},h}\) is the forecast error and
\(\hat {y}_{iT_{j},h}\) is the
hday ahead forecast with information up to
T
_{j} that stands for the last observation in each of the
V rolling windows. We also report the Pseudo
R
^{2} of the Mincer–Zarnowitz regression [
32] given by:
$$\displaystyle \begin{aligned} y_{T_{j},h}=a+b\hat{y}_{T_{j},h}+u_{T_{j}},\text{for }j=1,2,\ldots ,V, {} \end{aligned} $$
(19)
Each panel in Table
3 presents the result corresponding to a specific forecasting horizon. We consider various forecasting horizons
h = 1, 2, 4, and 7.
To ease interpretation, we focus on the following representative methods: HAR, HARCJ, HARRSII, LASSO, RF, BAG, and LSSVR with and without the USSI variable. Comparison results between all methods listed in Table
2 are available upon request. We find consistent ranking of modeling methods across all forecast horizons. The treebased machine learning methods (BAG and RF) have superior performance than all others for each panel. Moreover, methods with USSI (indicated by ∗) always dominate those without USSI, which indicates the importance of incorporating social media sentiment data. We also discover that the conventional econometric methods have unstable performance, for example, the HARRSII model without USSI has the worst performance when
h = 1, but its performance improves when
h = 2. The mixed performance of the linear models implies that this restrictive formulation may not be robust to model the highly volatile BTC/USD RV process.
To examine if the improvement from the BAG and RF methods is statistically significant, we perform the modified Giacomini–White test [
18] of the null hypothesis that the
column method performs equally well as the
row method in terms of MAFE. The corresponding
p values are presented in Table
4 for
h = 1, 2, 4, 7. We see that the gains in forecast accuracy from BAG
^{∗} and RF
^{∗} relative to all other strategies are statistically significant, although results between BAG
^{∗} and RF
^{∗} are statistically indistinguishable.
Table 4
Giacomini–White test results
Table 5
Forecasting performance by different window lengths (
h = 1)
7 Robustness Check
In this section, we perform four robustness checks of our main results. We first vary the window length for the rolling window exercise in Sect.
7.1. We next consider different sample periods in Sect.
7.2. We explore the use of different hyperparameters for the machine learning methods in Sect.
7.3. Our final robustness check examines if BTC/USD RV is correlated with other types of financial markets by including mainstream assets RV as additional covariates. Each of these robustness checks that are ported in the main text considers
h = 1.
^{16}
7.1 Different Window Lengths
In the main exercise, we set the window length
WL = 400. In this section, we also tried other window lengths
WL = 300 and 500. Table
5 shows the forecasting performance of all the estimators for various window lengths. In all the cases BAG
^{∗} and RF
^{∗} yield smallest MSFE, MAFE, and SDFE and the largest Pseudo
R
^{2}. We examine the statistical significance of the improvement on forecasting accuracy in Table
6. The small
pvalues on testing BAG
^{∗} and RF
^{∗} against other strategies indicate that the forecasting accuracy improvement is statistically significant at the 5% level.
Table 6
Giacomini–White test results by different window lengths (
h = 1)
7.2 Different Sample Periods
In this section, we partition the entire sample period in half: the first subsample period runs from May 20, 2015, to July 29, 2016, and the second subsample period runs from July 30, 2016, to Aug 20, 2017. We carry out the similar outofsample analysis with
WL = 200 for the two subsamples in Table
7 Panels A and B, respectively. We also examine the statistical significance in Table
8. The previous conclusions remain basically unchanged under the subsamples.
Table 7
Forecasting performance by different sample periods (
h = 1)
Table 8
Giacomini–White test results by different sample periods (
h = 1)
7.3 Different Tuning Parameters
In this section, we examine the effect of different tuning parameters for the machine learning methods. We consider a different set of tuning parameters:
B = 20 for RF and BAG, and
λ = 0.5 for LASSO, SVR, and LSSVR. The machine learning methods with the second set of tuning parameters are labeled as RF2, BAG2, and LASSO2. We replicate the main empirical exercise in Sect.
6 and compare the performance of machine learning methods with different tuning parameters.
The results are presented in Tables
9 and
10. Changes in the considered tuning parameters generally have marginal effects on the forecasting performance, although the results for the second tuning parameters are slightly worse than those under the default setting. Last, social media sentiment data plays a crucial role on improving the outofsample performance in each of these exercises.
Table 9
Forecasting performance by different tuning parameters (
h = 1)
Table 10
Giacomini–White test results by different tuning parameters (
h = 1)
7.4 Incorporating Mainstream Assets as Extra Covariates
In this section, we examine if the mainstream asset class has spillover effect on BTC/USD RV. We include the RVs of the S&P and NASDAQ indices ETFs (ticker names: SPY and QQQ, respectively) and the CBOE Volatility Index (VIX) as extra covariates. For SPY and QQQ, we proxy daily spot variances by daily realized variance estimates. For the VIX, we collect the daily data from CBOE. The extra covariates are described in Table
11
Table 11
Descriptive statistics
Statistics

SPY

QQQ

VIX


Mean

0.3839

0.7043

15.0144

Median

0.2034

0.3515

13.7300

Maximum

12.1637

70.6806

40.7400

Minimum

0.0143

0.0468

9.3600

Std. Dev.

0.6946

3.1108

4.5005

Skewness

10.1587

21.3288

1.6188

Kurtosis

158.5806

479.5436

6.3394

Jarque–Bera

0.0010

0.0010

0.0010

ADF Test

0.0010

0.0010

0.0010

The data range is from May 20, 2015, to August 18, 2017, with 536 total observations. Fewer observations are available since mainstream asset exchanges are closed on the weekends and holidays. We truncate the BTC/USD data accordingly. We compare forecasts from models with two groups of covariate data: one with only the USSI variable and the other which includes both the USSI variable and the mainstream RV data (SPY, QQQ, and VIX). Estimates that include the larger covariate set are denoted by the symbol ∗∗.
The rolling window forecasting results with
WL = 300 are presented in Table
12. Comparing results across any strategy between Panels A and B, we do not observe obvious improvements in forecasting accuracy. This implies that mainstream asset markets RV does not affect BTC/USD volatility, which reinforces the fact that cryptoassets are sometimes considered as a hedging device for many investment companies.
^{17}
Table 12
Forecasting performance
Last, we use the GW test to formally explore if there are no differences in forecast accuracy between the panels in Table
13. For each estimator, we present the
pvalues from different covariate groups in bold. Each of these pvalues exceeds 5%, which support our finding that mainstream asset RV data does not improve forecasts sharply, unlike the inclusion of social media data.
Table 13
Giacomini–White test results
8 Conclusion
In this chapter, we compare the performance of numerous econometric and machine learning forecasting strategies to explain the shortterm realized volatility of the Bitcoin market. Our results first complement a rapidly growing body of research that finds benefits from using machine learning techniques in the context of financial forecasting. Our application involves forecasting an asset that exhibits significantly more variation than much of the earlier literature which could present challenges in settings such as ours with fewer than 800 observations. Yet, our result further highlights that what drives the benefits of machine learning is the accounting for nonlinearities and there are much smaller gains from using regularization or crossvalidation. Second, we find substantial benefits from using social media data in our forecasting exercise that hold irrespective of the estimator. These benefits are larger when we consider new econometric tools to more flexibly handle the difference in the timing of the sampling of social media and financial data.
Taken together, there are benefits from using both new data sources from the social web and predictive techniques developed in the machine learning literature for forecasting financial data. We suggest that the benefits from these tools will likely increase as researchers begin to understand why they work and what they measure. While our analysis suggests nonlinearities are important to account for, more work is needed to incorporate heterogeneity from heteroskedastic data in machine learning algorithms.
^{18} We observe significant differences between SVR and LSSVR so the change in loss function can explain a portion of the gains within machine learning relative to econometric strategies, but not to the same extent as nonlinearities, which the treebased strategies also account for and use a similar loss function based on SSR.
Our investigation focused on the performance of what are currently the most popular algorithms considered by social scientists. There have been many advances developing powerful algorithms in the machine learning literature including deep learning procedures which consider more hidden layers than the neural network procedures considered in the econometrics literature between 1995 and 2015. Similarly, among treebased procedures, we did not consider eXtreme gradient boosting which applies more penalties in the boosting equation when updating trees and residual compared to the classic boosting method we employed. Both eXtreme gradient boosting and deep learning methods present significant challenges regarding interpretability relative to the algorithms we examined in the empirical exercise.
Further, machine learning algorithms were not developed for time series data and more work is needed to develop methods that can account for serial dependence, long memory, as well as the consequences of having heterogeneous investors.
^{19} That is, while time series forecasting is an important area of machine learning (see [
19,
30], for recent overviews that consider both onestepahead and multihorizon time series forecasting), concepts such as autocorrelation and stationarity which pervade developments in financial econometrics have received less attention. We believe there is potential for hybrid approaches in the spirit of Lehrer and Xie [
25] with group LASSO estimators. Further, developing machine learning approaches that consider interpretability appears crucial for many forecasting exercises whose results need to be conveyed to business leaders who want to make datadriven decisions. Last, given the random sample of Twitter users from which we measure sentiment, there is likely measurement error in our sentiment and our estimate should be interpreted as a lower bound.
Given the empirical importance of incorporating social media data in our forecasting models, there is substantial scope for further work that generates new insights with finer measures of this data. For example, future work could consider extracting Twitter messages that only capture the views of market participants rather than the entire universe of Twitter users. Work is also needed to clearly identify bots and consider how best to handle fake Twitter accounts. Similarly, research could strive to understand shifting sentiment for different groups on social media in response to news events. This can help improve our understanding of how responses to unexpected news leads lead investors to reallocate across asset classes.
^{20}
In summary, we remain at the early stages of extracting the full set of benefits from machine learning tools used to measure sentiment and conduct predictive analytics. For example, the Bitcoin market is international but the tweets used to estimate sentiment in our analysis were initially written in English. Whether the findings are robust to the inclusion of Tweets posted in other languages represents an open question for future research. As our understanding of how to account for realworld features of data increases with these data science tools, the full hype of machine learning and data science may be realized.
Acknowledgements
We wish to thank Yue Qiu, Jun Yu, and Tao Zeng, seminar participants at Singapore Management University, for helpful comments and suggestions. Xie’s research is supported by the Natural Science Foundation of China (71701175), the Chinese Ministry of Education Project of Humanities and Social Sciences (17YJC790174), and the Fundamental Research Funds for the Central Universities. Contact Tian Xie (e mail: xietian@shufe.edu.cn) for any questions concerning the data and/or codes. The usual caveat applies.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Appendix: Data Resampling Techniques
Substantial progress has been made in the machine learning literature on quickly converting text to data, generating realtime information on social media content. In this study, we also explore the benefits of incorporating an aggregate measure of social media sentiment, the Wall Street JournalIHS Markit US Sentiment Index (USSI) in forecasting the Bitcoin RV. However, data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. To convert minutely USSI measure to match the sampling frequency of Bitcoin RV, we hereby introduce a few popular data resampling techniques.
Let
y
_{t+h} be target
hstepahead future a lowfrequency variable (e.g., the daily realized variance) that is sampled at periods denoted by a time index
t for
t = 1, …,
n. Consider a higherfrequency (e.g., the USSI) predictor
\(\boldsymbol {X}^{hi}_t\) that is sampled
m times within the period of
t:
A specific element among the highfrequency observations in
\(\boldsymbol {X} ^{hi}_t\) is denoted by
\(X^{hi}_{t\frac {i}{m}}\) for
i = 0, …,
m − 1. Denoting
L
^{i∕m} as the lag operator, then
\(X^{hi}_{t\frac {i}{m}}\) can be reexpressed as
\(X^{hi}_{t\frac {i}{m}} = L^{i/m} X^{hi}_t\) for
i = 0, …,
m − 1.
$$\displaystyle \begin{aligned} \boldsymbol{X}_{t}^{h}\equiv\left[X^{hi}_{t}, X^{hi}_{t\frac{1}{m}}, \ldots, X^{hi}_{t\frac{m1}{m}}\right]^{\top }. {} \end{aligned} $$
(20)
Since
\(\boldsymbol {X}_{t}^{h}\) on
y
_{t+h} is measured at different frequencies, we need to convert the higherfrequency data to match the lowerfrequency data. A simple average of the highfrequency observations
\( \boldsymbol {X}_{t}^{h}\):
$$\displaystyle \begin{aligned} \bar{X}_{t}=\frac{1}{m}\sum_{i=0}^{m1}L^{i/m}X_{t}^{h}, \end{aligned}$$
where
\(\bar {X}_{t}\) is likely the easiest way to estimate a lowfrequency
X
_{t} that can match the frequency of
y
_{t+h}. With the variables
y
_{t+h} and
\(\bar {X}_{t}\) being measured in the same time domain, a regression approach is simply
where
α is the intercept and
γ is the slope coefficient on the timeaveraged
\(\bar {X}_{t}\). This approach assumes that each element in
\( \boldsymbol {X}_{t}^{h}\) has an identical effect on explaining
y
_{t+h}.
$$\displaystyle \begin{aligned} y_{t+h}={\alpha }+\gamma \bar{X}_{t}+\epsilon _{t}={\alpha }+\frac{\gamma }{m }\sum_{i=0}^{m1}L^{i/m}X_{t}^{h}+\epsilon _{t}, {} \end{aligned} $$
(21)
These homogeneity assumptions may be quite strong in practice. One could assume that each of the slope coefficients for each element in
\(\boldsymbol {X }^{hi}_t\) is unique. Following Lehrer et al. [
28], extending Model (
21) to allow for heterogeneous effects of the highfrequency observations generates
where
γ
_{i} represents a set of slope coefficients for all highfrequency observations
\(X^{hi}_{t\frac {i}{m}}\).
$$\displaystyle \begin{aligned} y_{t+h}= {\alpha }+\sum_{i=0}^{m1}\gamma _{i}L^{i/m}X_{t}^{hi}+\epsilon _{t}, {} \end{aligned} $$
(22)
Since
γ
_{i} is unknown, estimating these parameters can be problematic when
m is a relatively large number. The heterogeneous mixed data sampling (HMIDAS) method by Lehrer et al. [
28] uses a step function to allow for heterogeneous effects of different highfrequency observations on the lowfrequency dependent variable. A lowfrequency
\(\bar {X}_{t}^{\left ( l\right ) } \) can be constructed following
where
l is a predetermined number and
l ≤
m. Equation (
23) implies that we compute variable
\(\bar {X}_{t}^{\left ( l\right ) }\) by a simple average of the first
l observations in
\(\boldsymbol {X}^{hi}_t\) and ignored the remaining observations. We consider different values of
l and group all
\(\bar {X}_{t}^{\left ( l\right ) }\) into
\(\boldsymbol {\tilde {X}}_{t}\) such that
$$\displaystyle \begin{aligned} \bar{X}_{t}^{\left( l\right) } \equiv \frac{1}{l} \sum_{i=0}^{l1}L^{i/m}X^{hi}_{t}=\frac{1}{l}\sum_{i=0}^{l1}X^{hi}_{t\frac{ i}{m}}, {} \end{aligned} $$
(23)
$$\displaystyle \begin{aligned} \boldsymbol{\tilde{X}}_{t}=\left[ \bar{X}_{t}^{\left( l_{1}\right) },\bar{X} _{t}^{\left( l_{2}\right) },\ldots ,\bar{X}_{t}^{\left( l_{p}\right) }\right] , \end{aligned}$$
where we set
l
_{1} <
l
_{2} < ⋯ <
l
_{p}. Consider a weight vector
\( \boldsymbol {w=}\left [ w_{1},w_{2},\ldots ,w_{p}\right ] ^{{ }^{\top }}\) with
\( \sum _{j=1}^{p}w_{j}=1\); we can construct regressor
\({X}_{t}^{new}\) as
\({X} _{t}^{new}=\boldsymbol {\tilde {X}}_{t} \boldsymbol {w}. \) The regression based on the HMIDAS estimator can be expressed as
where
l
_{0} = 0 and
\(w_{s}^{\ast }=\)
\(\sum _{j=s}^{p}\frac {w_{j}}{l_{j}}\).
$$\displaystyle \begin{aligned} y_{t+h} =\beta {X}_{t}^{new}+\epsilon _{t} = \beta \sum_{s=1}^{p}\sum_{j=s}^{p}\frac{w_{j}}{l_{j}} \sum_{i=l_{s1}}^{l_{s}1}L^{i/m}X_{t}^{h}+\epsilon _{t} = \beta \sum_{s=1}^{p}\sum_{i=l_{s1}}^{l_{s}1}w_{s}^{\ast }L^{i/m}X_{t}^{h}+\epsilon _{t}, {} \end{aligned} $$
(24)
The weights
w play a crucial role in this procedure. We first estimate
\(\widehat {\beta \boldsymbol {w}}\) following
by any appropriate econometric method necessary, where
\(\mathcal {W}\) is some predetermined weight set. Once
\(\widehat {\beta \boldsymbol {w}}\) is obtained, we estimate the weight vector
\(\hat {\boldsymbol {w}}\) by rescaling following
since the coefficient
β is a scalar.
$$\displaystyle \begin{aligned} \hat{\boldsymbol{w}} = \frac{\widehat{\beta\boldsymbol{w}}}{\text{Sum}( \widehat{\beta\boldsymbol{w}})}, \end{aligned}$$
Footnotes
1
See [
25,
26], for example, with data from the film industry that conducts horse races between various strategies. Medeiros et al. [
31] use the random forest estimator to examine the benefits of machine learning for forecasting inflation. Last, Coulombe et al. [
13] conclude that the benefits from machine learning over econometric approaches for macroeconomic forecasting arise since they capture important nonlinearities that arise in the context of uncertainty and financial frictions.
2
Traditional econometric approaches to model and forecast such as the parametric GARCH or stochastic volatility models include measures built on daily, weekly, and monthly frequency data. While popular, empirical studies indicate that they fail to capture all information in highfrequency data; see [
1,
7,
20], among others.
4
Corsi et al. [
12] provide a comprehensive review of the development of HARtype models and their various extensions. The HAR model provides an intuitive economic interpretation that agents with three frequencies of trading (daily, weekly, and monthly) perceive and respond to, which changes the corresponding components of volatility. Müller et al. [
33] refer to this idea as the Heterogeneous Market Hypothesis. Nevertheless, the suitability of such a specification is not subject to enough verification. Craioveanu and Hillebrand [
14] employ a parallel computing method to investigate all of the possible combinations of lags (chosen within a maximum lag of 250) for the last two terms in the additive model, and they compared their insample and outofsample fitting performance.
5
We note that the assumption of equal weight is strong. Mai et al. [
29] find that social media sentiment is an important predictor in determining Bitcoin’s valuation, but not all social media messages are of equal impact. Yet, our measure of social media is collected from all Twitter users, a more diverse group than users of cryptocurrency forums in [
29]. Thus, if we find any effect, it is likely a lower bound since our measure of social media sentiment likely has classical measurement error.
6
Mining is challenging since new blocks and miners are paid any transaction fees as well as a “subsidy” of newly created coins. For the new block to be considered valid, it must contain a proof of work that is verified by other Bitcoin nodes each time they receive a block. By downloading and verifying the blockchain, Bitcoin nodes are able to reach consensus about the ordering of events in Bitcoin. Any currency that is generated by a malicious user that does not follow the rules will be rejected by the network and thus is worthless. To make each new block more challenging to mine, the rate at which a new block can be found is recalculated every 2016 blocks increasing the difficulty.
7
For example, the legendary former Legg Mason’ Chief Investment Officer Bill Miller’s fund has been reported to have 50% exposure to cryptoassets. There is also a growing set of decentralized exchanges, including IDEX, 0x, etc., but their market shares remain low today. Furthermore, given the SEC’s recent charge against EtherDelta, a wellknown Ethereumbased decentralized exchange, the future of decentralized exchanges faces significant uncertainties.
8
Apart from Bitcoin, there are more than 1600 other alter coin or cryptocurrencies listed over 200 different exchanges. However, Bitcoin still maintains roughly 50% market dominance. At the end of December 2018, the market capitalization of Bitcoin is roughly 65 billion USD with 3800 USD per token. On December 17, 2017, it reached 330 billion USD cap peak with almost 19,000 USD per Bitcoin according to
Coinmarketcap.
com.
9
Using the log to transform the realized variance is standard in the literature, motivated by avoiding imposing positive constraints and considering the residuals of the below regression to have heteroskedasticity related to the level of the process, as mentioned by Patton and Sheppard [
34]. An alternative is to implement weighted least squares (WLS) on RV, which does not suit well our purpose of using the least squares model averaging method.
10
For example, Gu et al. [
19] perform a comparative analysis of machine learning methods for measuring asset risk premia. Ban et al. [
6] adopt machine learning methods for portfolio optimization. Beyond academic research, the popularity of algorithmbased quantitative exchangetraded funds (ETF) has increased among investors, in part since as LaFon [
24] points out they both offer lower management fees and volatility than traditional stockpicking funds.
11
This is an impossibility theorem that rules out the possibility that a generalpurpose universal optimization strategy exists. As such, researchers should examine the sensitivity of their findings to alternative strategies.
12
A best split is determined by a given loss function, for example, the reduction of the sum of squared residuals (SSR). A simple regression will yield a sum of squared residuals, SSR
_{0}. Suppose we can split the original sample into two subsamples such that
n =
n
_{1} +
n
_{2}. The RT method finds the best split of a sample to minimize the SSR from the two subsamples. That is, the SSR values computed from each subsample should follow: SSR
_{1} + SSR
_{2} ≤ SSR
_{0}.
13
This is a 10% random sample of all tweets since the USSI was designed to measure the realtime mood of the nation and the algorithm does not restrict the calculations to Twitter accounts that either mention any specific stock or are classified as being a market participant.
14
We provide full details on this strategy in the appendix. In practice, we need to select the lag index
l = [
l
_{1}, …,
l
_{p}] and determine the weight set
\(\mathcal {W}\) before the estimation. In this study, we set
\(\mathcal {W}\equiv \{\boldsymbol {w}\in \mathbb {R} ^{p}:\sum _{j=1}^{p}w_{j}=1\}\) and use OLS to estimate
\(\widehat {\beta \boldsymbol {w}}\). We consider
h = 1, 2, 4, and 7 as in the main exercise. For the lag index, we consider
l = [1 : 5 : 1440], given there are 1440 minutes per day.
15
Additional results using both the GARCH(1, 1) and the ARFIMA(
p,
d,
q) models are available upon request. These estimators performed poorly relative to the HAR model and as such are not included for space considerations.
16
Although not reported due to space considerations, we investigated other forecasting horizons and our main findings are robust.
17
PwCElwood [
36] suggests that the capitalization of cryptocurrency hedge funds increases at a steady pace since 2016.
18
Lehrer and Xie [
26] pointed out that all of the machine learning algorithms considered in this paper assume homoskesdastic data. In their study, they discuss the consequences of heteroskedasticity for these algorithms and the resulting predictions, as well as propose alternatives for this data.
19
Lehrer et al. [
27] considered the use of model averaging with HAR models to account for heterogeneous investors.
20
As an example, following the removal of Ivanka Trump’s fashion line from their stores, President Trump issued a statement via Twitter:
My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person – always pushing me to do the right thing! Terrible!
The general public response to this Tweet was to disagree with President Trump’s stance on Nordstrom so aggregate Twitter sentiment measures rose and the immediate negative effects from the Tweet on Nordstrom stock of a decline of 1% in the minute following the tweet were fleeting since the stock closed the session posting a gain of 4.1%. See
http://www.marketwatch.com/story/nordstromrecoversfromtrumpsterribletweetinjust4minutes20170208 for more details on this episode.