Top

Published in:

Open Access 2021 | OriginalPaper | Chapter

Do the Hype of the Benefits from Using New Data Science Tools Extend to Forecasting Extremely Volatile Assets?

Authors : Steven F. Lehrer, Tian Xie, Guanxi Yi

Published in: Data Science for Economics and Finance

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

This chapter first provides an illustration of the benefits of using machine learning for forecasting relative to traditional econometric strategies. We consider the short-term volatility of the Bitcoin market by realized volatility observations. Our analysis highlights the importance of accounting for nonlinearities to explain the gains of machine learning algorithms and examines the robustness of our findings to the selection of hyperparameters. This provides an illustration of how different machine learning estimators improve the development of forecast models by relaxing the functional form assumptions that are made explicit when writing up an econometric model. Our second contribution is to illustrate how deep learning can be used to measure market-level sentiment from a 10% random sample of Twitter users. This sentiment variable significantly improves forecast accuracy for every econometric estimator and machine algorithm considered in our forecasting application. This provides an illustration of the benefits of new tools from the natural language processing literature at creating variables that can improve the accuracy of forecasting models.

1 Introduction

Over the past few years, the hype surrounding words ranging from big data to data science to machine learning has increased from already high levels. This hype arises in part from three sets of discoveries. Machine learning tools have repeatedly been shown in the academic literature to outperform statistical and econometric techniques for forecasting.¹ Further, tools developed in the natural language processing literature that are used to extract population sentiment measures have also been found to help forecast the value of financial indices. This set of finding is consistent with arguments in the behavioral finance literature (see [23], among others) that the sentiment of investors can influence stock market activity. Last, issues surrounding data security and privacy have grown among the population as a whole, leading governments to consider blockchain technology for uses beyond what it was initially developed for.

Blockchain technology was originally developed for the cryptocurrency Bitcoin, an asset that can be continuously traded and whose value has been quite volatile. This volatility may present further challenges for forecasts by either machine learning algorithms or econometric strategies. Adding to these challenges is that unlike almost every other financial asset, Bitcoin is traded on both the weekend and holidays. As such, modeling the estimated daily realized variance of Bitcoin in US dollars presents an additional challenge. Many measures of conventional economic and financial data commonly used as predictors are not collected at the same points in time. However, since the behavioral finance literature has linked population sentiment measures to the price of different financial assets, we propose measuring and incorporating social media sentiment as an explanatory variable in the forecasting model. As an explanatory predictor, social media sentiment can be measured continuously providing a chance to capture and forecast the variation in the prices at which trades for Bitcoin are made.

In this chapter, we consider forecasts of Bitcoin realized volatility to first provide an illustration of the benefits in terms of forecast accuracy of using machine learning relative to traditional econometric strategies. While prior work contrasting approaches to conduct a forecast found that machine learning does provide gains primarily from relaxing the functional form assumptions that are made explicit when writing up an econometric model, those studies did not consider predicting an outcome that exhibits a degree of volatility of the magnitude of Bitcoin.

Determining strategies that can improve volatility forecasts is of significant value since they have come to play a large role in decisions ranging from asset allocation to derivative pricing and risk management. That is, volatility forecasts are used by traders as a component of their valuation procedure of any risky asset’s value (e.g., stock and bond prices), since the procedure requires assessing the level and riskiness of future payoffs. Further, their value to many investors arises when using a strategy that adjust their holdings to equate the risk stemming from the different investments included in a portfolio. As such, more accurate volatility forecasts can provide valuable actionable insights for market participants. Finally, additional motivation for determining how to obtain more accurate forecasts comes from the financial media who frequently report on market volatility since it is hypothesized to have an impact on public confidence and thereby can have a significant effect on the broader global economy.

There are many approaches that could be potentially used to undertake volatility forecasts, but each requires an estimate of volatility. At present, the most popular method used in practice to estimate volatility was introduced by Andersen and Bollerslev [1] who proposed using the realized variance, which is calculated as the cumulative sum of squared intraday returns over short time intervals during the trading day.² Realized volatility possesses a slowly decaying autocorrelation function, sometimes known as long memory.³ Various econometric models have been proposed to capture the stylized facts of these high-frequency time series models including the autoregressive fractionally integrated moving average (ARFIMA) models of Andersen et al. [3] and the heterogeneous autoregressive (HAR) model proposed by Corsi [11]. Compared with the ARFIMA model, the HAR model rapidly gained popularity, in part due to its computational simplicity and excellent out-of-sample forecasting performance. ⁴

In our empirical exercise, we first use well-established machine learning techniques within the HAR framework to explore the benefits of allowing for general nonlinearities with recursive partitioning methods as well as sparsity using the least absolute shrinkage and selection operator (LASSO) of Tibshirani [39]. We consider alternative ensemble recursive partitioning methods including bagging and random forest that each place equal weight on all observations when making a forecast, as well as boosting that places alternative weight based on the degree of fit. In total, we evaluate nine conventional econometric methods and five easy-to-implement machine learning methods to model and forecast the realized variance of Bitcoin measured in US dollars.

Studies in the financial econometric literature have reported that a number of different variables are potentially relevant for the forecasting of future volatility. A secondary goal of our empirical exercise is to determine if there are gains in forecast accuracy of realized volatility by incorporating a measure of social media sentiment. We contrast forecasts using models that both include and exclude social media sentiment. This additional exercise allows us to determine if this measure provides information that is not captured by either the asset-specific realized volatility histories or other explanatory variables that are often included in the information set.

Specifically, in our application social media sentiment is measured by adopting a deep learning algorithm introduced in [17]. We use a random sample of 10% of all tweets posted from users based in the United States from the Twitterverse collected at the minute level. This allows us to calculate a sentiment score that is an equal tweet weight average of the sentiment values of the words within each Tweet in our sample at the minute level.⁵ It is well known that there are substantial intraday fluctuations in social media sentiment but its weekly and monthly aggregates are much less volatile. This intraday volatility may capture important information and presents an additional challenge when using this measure for forecasting since the Bitcoin realized variance is measured at the daily level, a much lower time frequency than the minute-level sentiment index that we refer to as the US Sentiment Index (USSI). Rather than make ad hoc assumptions on how to aggregate the USSI to the daily level, we follow Lehrer et al. [28] and adopt the heterogeneous mixed data sampling (H-MIDAS) method that constructs empirical weights to aggregate the high-frequency social media data to a lower frequency.

Our analysis illustrates that sentiment measures extracted from Twitter can significantly improve forecasting efficiency. The gains in forecast accuracy as pseudo R-squared increased by over 50% when social media sentiment was included in the information set for all of the machine learning and econometric strategies considered. Moreover, using four different criteria for forecast accuracy, we find that the machine learning techniques considered tend to outperform the econometric strategies and that these gains arise by incorporating nonlinearities. Among the 16 methods considered in our empirical exercise, both bagging and random forest yield the highest forecast accuracy. Results from the [18] test indicate that the improvements that each of these two algorithms offers are statistically significant at the 5% level, yet the difference between these two algorithms is indistinguishable.

For practitioners, our empirical exercise also contains exercises including examining the sensitivity of our findings to the choices of hyperparameters made when implementing any machine learning algorithm. This provides value since the settings of the hyperparameters with any machine learning algorithm can be thought of in an analogous manner to model selection in econometrics. For example, with the random forest algorithm, numerous hyperparameters can be adjusted by the researcher including the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. Further, Probst and Boulesteix provide evidence that the benefits from changing hyperparameters differ across machine learning algorithms and are higher with the support vector regression than the random forest algorithm we employ. In our analysis, the default values of the hyperparameters specified in software packages work reasonably well, but we stress a caveat that our investigation was not exhaustive so there remains a possibility that there are particular specific combinations of hyperparameters with each algorithm that may lead to changes in the ordering of forecast accuracy in the empirical horse race presented. Thus, there may be a set of hyperparameters where the winning algorithms have a distinguishable different effect from the others that it is being compared to.

This chapter is organized as follows. In the next section, we briefly describe Bitcoin. Sections 3 and 4 provide a more detailed overview of existing HAR strategies as well as conventional machine learning algorithms. Section 5 describes the data we utilize and explains how we measure and incorporate social media data into our empirical exercise. Section 6 presents our main empirical results that compare the forecasting performance of each method introduced in Sects. 3 and 4 in a rolling window exercise. To focus on whether social media sentiment data adds value, we contrast the results of incorporating the USSI variable in each strategy to excluding this variable from the model. For every estimator considered, we find that incorporating the USSI variable as a covariate leads to significant improvements in forecast accuracy. We examine the robustness of our results by considering (1) different experimental settings, (2) different hyperparameters, and (3) incorporating covariates on the value of mainstream assets, in Sect. 7. We find that our main conclusions are robust to both changes in the hyperparameters and various settings, as well as little benefits from incorporating mainstream asset markets when forecasting the realized volatility in the value of Bitcoin. Section 8 concludes by providing additional guidance to practitioners to ensure that they can gain the full value of the hype for machine learning and social media data in their applications.

2 What Is Bitcoin?

Bitcoin, the first and still one of the most popular applications of the blockchain technology by far, was introduced in 2008 by a person or group of people known by the pseudonym, Satoshi Nakamoto. Blockchain technology allows digital information to be distributed but not copied. Basically, a time-stamped series of immutable records of data are managed by a cluster of computers that are not owned by any single entity. Each of these blocks of data (i.e., block) is secured and bound to each other using cryptographic principles (i.e., chain). The blockchain network has no central authority and all information on the immutable ledger is shared. The information on the blockchain is transparent and each individual involved is accountable for their actions.

The group of participants who uphold the blockchain network ensure that it can neither be hacked or tampered with. Additional units of currency are created by the nodes of a peer-to-peer network using a generation algorithm that ensures decreasing supply that was designed to mimic the rate at which gold was mined. Specifically, when a user/miner discovers a new block, they are currently awarded 12.5 Bitcoins. However, the number of new Bitcoins generated per block is set to decrease geometrically, with a 50% reduction every 210,000 blocks. The amount of time it takes to find a new block can vary based on mining power and the network difficulty.⁶ This process is why it can be treated by investors as an asset and ensures that causes of inflation such as printing more currency or imposing capital controls by a central authority cannot take place. The latter monetary policy actions motivated the use of Bitcoin, the first cryptocurrency as a replacement for fiat currencies.

Bitcoin is distinguished from other major asset classes by its basis of value, governance, and applications. Bitcoin can be converted to a fiat currency using a cryptocurrency exchange, such as Coinbase or Kraken, among other online options. These online marketplaces are similar to the platforms that traders use to buy stock. In September 2015, the Commodity Futures Trading Commission (CFTC) in the United States officially designated Bitcoin as a commodity. Furthermore, the Chicago Mercantile Exchange in December 2017 launched a Bitcoin future (XBT) option, using Bitcoin as the underlying asset. Although there are emerging crypto-focused funds and other institutional investors,⁷ this market remains retail investor dominated.⁸

There is substantial volatility in BTC/USD, and the sharp price fluctuations in this digital currency greatly exceed that of most other fiat currencies. Much research has explored why Bitcoin is so volatile; our interest is strictly to examine different empirical strategies to forecast this volatility, which greatly exceeds that of other assets including most stocks and bonds.

3 Bitcoin Data and HAR-Type Strategies to Forecast Volatility

The price of Bitcoin is often reported to experience wild fluctuations. We follow Xie [42] who evaluates model averaging estimators with data on the Bitcoin price in US dollars (henceforth BTC/USD) at a 5-min. frequency between May 20, 2015, and Aug 20, 2017. This data was obtained from Poloniex, one of the largest US-based digital asset exchanges. Following Andersen and Bollerslev [1], we estimate the daily realized volatility at day t (RV_t) by summing the corresponding M equally spaced intra-daily squared returns r _t,j. Here, the subscript t indexes the day, and j indexes the time interval within day t:

$$\displaystyle \begin{aligned} \text{RV}_{t}\equiv \sum_{j=1}^{M}r_{t,j}^{2} \end{aligned} $$

(1)

where t = 1, 2, …, n, j = 1, 2, …, M, and r _t,j is the difference between log-prices p _t,j (r _t,j = p _t,j − p _t,j−1). Poloniex is an active exchange that is always in operation, every minute of each day in the year. We define a trading day using Eastern Standard Time and with data calculate realized volatility of BTC/USD for 775 days. The evolution of the RV data over this full sample period is presented in Fig. 1.

In this section, we introduce some HAR-type strategies that are popular in modeling volatility. The standard HAR model of Corsi [11] postulates that the h-step-ahead daily RV_t+h can be modeled by⁹

$$\displaystyle \begin{aligned} log\text{RV}_{t+h}=\beta _{0}+\beta _{d}log\text{RV}_{t}^{(1)}+\beta _{w}log \text{RV}_{t}^{(5)}+\beta _{m}log\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{aligned} $$

(2)

where the βs are the coefficients and {e _t}_t is a zero mean innovation process. The explanatory variables take the general form of $log \text{RV}_{t}^{(l)}$ that is defined as the l period averages of daily log RV:

$$\displaystyle \begin{aligned} log\text{RV}_{t}^{(l)}\equiv {l}^{-1}\sum_{s=1}^{l}log\text{RV}_{t-s}. \end{aligned} $$

Another popular formulation of the HAR model in Eq. (2) ignores the logarithmic form and considers

$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}\text{RV}_{t}^{(1)}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{aligned} $$

(3)

where $\text{RV}_{t}^{(l)}\equiv {l}^{-1}\sum _{s=1}^{l}\text{RV}_{t-s}$.

In an important paper, Andersen et al. [4] extend the standard HAR model from two perspectives. First, they added a daily jump component (J_t) to Eq. (3). The extended model is denoted as the HAR-J model:

$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}\text{RV}_{t}^{(1)}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+\beta ^{j}\text{J}_{t}+e_{t+h}, {} \end{aligned} $$

(4)

where the empirical measurement of the squared jumps is $\text{J}_{t}=\max (\text{RV}_{t}-\text{BPV}_{t},0)$ and the standardized realized bipower variation (BPV) is defined as

$$\displaystyle \begin{aligned} \text{BPV}_{t}\equiv (2/\pi )^{-1}\sum_{j=2}^{M}|r_{t,j-1}||r_{t,j}|. \end{aligned}$$

Second, through a decomposition of RV into the continuous sample path and the jump components based on the Z _t statistic [22], Andersen et al. [4] extend the HAR-J model by explicitly incorporating the two types of volatility components mentioned above. The Z _t statistic respectively identifies the “significant” jumps CJ_t and continuous sample path components CSP_t by

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{CSP}_{t} & \equiv &\displaystyle \mathbb{I}(Z_{t}\leq \varPhi _{\alpha })\cdot \text{RV} _{t}+\mathbb{I}(Z_{t}> \varPhi _{\alpha })\cdot \text{BPV}_{t}, \\ \text{CJ}_{t} & =&\displaystyle \mathbb{I}(Z_{t}>\varPhi _{\alpha })\cdot (\text{RV}_{t}- \text{BPV}_{t}). \end{array} \end{aligned} $$

where Z _t is the ratio-statistic defined in [22] and Φ _α is the cumulative distribution function(CDF) of a standard Gaussian distribution with α level of significance. The daily, weekly, and monthly average components of CSP_t and CJ_t are then constructed in the same manner as RV^(l). The model specification for the continuous HAR-J, namely, HAR-CJ, is given by

$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}^{c}\text{CSP}_{t}^{(1)}+\beta _{w}^{c} \text{CSP}_{t}^{(5)}+\beta _{m}^{c}\text{CSP}_{t}^{(22)}+\beta _{d}^{j}\text{ CJ}_{t}^{(1)}+\beta _{w}^{j}\text{CJ}_{t}^{(5)}+\beta _{m}^{j}\text{CJ} _{t}^{(22)}+e_{t+h}. {} \end{aligned} $$

(5)

Note that compared with the HAR-J model, the HAR-CJ model explicitly controls for the weekly and monthly components of continuous jumps. Thus, the HAR-J model can be treated as a special and restrictive case of the HAR-CJ model for

$$\displaystyle \begin{aligned}\beta _{d}=\beta _{d}^{c}+\beta _{d}^{j}, \beta ^{j}=\beta _{d}^{j}, \beta _{w}=\beta _{w}^{c}+\beta _{w}^{j},\ \text{and}\ \beta _{m}=\beta _{m}^{c}+\beta _{m}^{j}.\end{aligned} $$

To capture the role of the “leverage effect” in predicting volatility dynamics, Patton and Sheppard [34] develop a series of models using signed realized measures. The first model, denoted as HAR-RS-I, decomposes the daily RV in the standard HAR model (3) into two asymmetric semi-variances $\text{ RS}_{t}^{+}$ and $\text{RS}_{t}^{-}$:

$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{d}^{+}\text{RS}_{t}^{+}+\beta _{d}^{-} \text{RS}_{t}^{-}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {}\end{aligned} $$

(6)

where $\text{RS}_{t}^{-}=\sum _{j=1}^{M}r_{t,j}^{2}\cdot \mathbb {I} (r_{t,j}<0) $ and $\text{RS}_{t}^{+}=\sum _{j=1}^{M}r_{t,j}^{2}\cdot \mathbb {I }(r_{t,j}>0) $. To verify whether the realized semi-variances add something beyond the classical leverage effect, Patton and Sheppard [34] augment the HAR-RS-I model with a term interacting the lagged RV with an indicator for negative lagged daily returns $\text{RV}_{t}^{(1)}\cdot \mathbb {I}(r_{t}<0)$ . The second model in Eq. (7) is denoted as HAR-RS-II:

$$\displaystyle \begin{aligned} \text{RV}_{t+h}=\beta _{0}+\beta _{1}\text{RV}_{t}^{(1)}\cdot \mathbb{I} (r_{t}<0)+\beta _{d}^{+}\text{RS}_{t}^{+}+\beta _{d}^{-}\text{RS} _{t}^{-}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {} \end{aligned} $$

(7)

where $\text{RV}_{t}^{(1)}\cdot \mathbb {I}(r_{t}<0)$ is designed to capture the effect of negative daily returns. As in the HAR-CJ model, the third and fourth models in [34], denoted as HAR-SJ-I and HAR-SJ-II, respectively, disentangle the signed jump variations and the BPV from the volatility process:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{RV}_{t+h} & =&\displaystyle \beta _{0}+\beta _{d}^{j}\text{SJ}_{t}+\beta _{d}^{bpv} \text{BPV}_{t}+\beta _{w}\text{RV}_{t}^{(5)}+\beta _{m}\text{RV} _{t}^{(22)}+e_{t+h}, {} \end{array} \end{aligned} $$

(8)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{RV}_{t+h} & =&\displaystyle \beta _{0}+\beta _{d}^{j-}\text{SJ}_{t}^{-}+\beta _{d}^{j+}\text{SJ}_{t}^{+}+\beta _{d}^{bpv}\text{BPV}_{t}+\beta _{w}\text{RV} _{t}^{(5)}+\beta _{m}\text{RV}_{t}^{(22)}+e_{t+h}, {} \end{array} \end{aligned} $$

(9)

where $\text{SJ}_{t}=\text{RS}_{t}^{+}-\text{RS}_{t}^{-}$, $\text{SJ} _{t}^{+}=\text{SJ}_{t}\cdot \mathbb {I}(\text{SJ}_{t}>0)$, and $\text{SJ} _{t}^{-}=\text{SJ}_{t}\cdot \mathbb {I}(\text{SJ}_{t}<0)$. The HAR-SJ-II model extends the HAR-SJ-I model by being more flexible to allow the effect of a positive jump variation to differ in unsystematic ways from the effect of a negative jump variation.

The models discussed above can be generalized using the following formulation in practice:

$$\displaystyle \begin{aligned} y_{t+h}=\boldsymbol{x}_{t}\boldsymbol{\beta }+e_{t+h} \end{aligned}$$

for t = 1, …, n, where y _t+h stands for RV_t+h and variable x _t collects all the explanatory variables such that

$$\displaystyle \begin{aligned} \boldsymbol{x}_{t}\equiv \left\{ \begin{array}{ll} \big[1,\text{RV}_{t}^{(1)},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HAR in (3)}, \\ \big[1,\text{RV}_{t}^{(1)},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)},\text{J} _{t}\big] & \text{for model HAR-J in (4)}, \\ \big[1,\text{CSP}_{t}^{(1)},\text{CSP}_{t}^{(5)},\text{CSP}_{t}^{(22)},\text{ CJ}_{t}^{(1)},\text{CJ}_{t}^{(5)},\text{CJ}_{t}^{(22)}\big] & \text{for model HAR-CJ in (5)}, \\ \big[1,\text{RS}_{t}^{-},\text{RS}_{t}^{+},\text{RV}_{t}^{(5)},\text{RV} _{t}^{(22)}\big] & \text{for model HAR-RS-I in (6)}, \\ \big[1,\text{RV}_{t}^{(1)}\mathbb{I}_{{r_{t}<0}},\text{RS}_{t}^{-},\text{RS} _{t}^{+},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HAR-RS-II in (7)}, \\ \big[1,\text{SJ}_{t},\text{BPV}_{t},\text{RV}_{t}^{(5)},\text{RV}_{t}^{(22)} \big] & \text{for model HAR-SJ-I in (8)}, \\ \big[1,\text{SJ}_{t}^{-},\text{SJ}_{t}^{+},\text{BPV}_{t},\text{RV} _{t}^{(5)},\text{RV}_{t}^{(22)}\big] & \text{for model HAR-SJ-II in (9)}. \end{array} \right.\end{aligned} $$

Since y _t+h is infeasible in period t, in practice, we usually obtain the estimated coefficient $\hat {\boldsymbol {\beta }}$ from the following model:

$$\displaystyle \begin{aligned} y_{t}=\boldsymbol{x}_{t-h}\boldsymbol{\beta }+e_{t}, {}\end{aligned} $$

(10)

in which both the independent and dependent variables are feasible in period t = 1, …, n. Once the estimated coefficients $\hat {\boldsymbol {\beta }}$ are obtained, the h-step-ahead forecast can be estimated by

$$\displaystyle \begin{aligned} \hat y_{t+h}= \boldsymbol{x}_{t}\hat{\boldsymbol{\beta }}\ \text{for}\ t=1,\ldots,n.\end{aligned}$$

4 Machine Learning Strategy to Forecast Volatility

Machine learning tools are increasingly being used in the forecasting literature.¹⁰ In this section, we briefly describe five of the most popular machine learning algorithms that have been shown to outperform econometric strategies when conducting forecast. That said, as Lehrer and Xie [26] stress the “No Free Lunch” theorem of Wolpert and Macready [41] indicates that in practice, multiple algorithms should be considered in any application. ¹¹

The first strategy we consider was developed to assist in the selection of predictors in the main model. Consider the regression model in Eq. (10), which contains many explanatory variables. To reduce the dimensionality of the set of the explanatory variables, Tibshirani [39] proposed the LASSO estimator of $\hat {\boldsymbol { \beta }}$ that solves

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Equ12_HTML.png

(11)

where λ is a tuning parameter that controls the penalty term. Using the estimates of Eq. (11), the h-step-ahead forecast is constructed in an identical manner as OLS:

$$\displaystyle \begin{aligned} \hat{y}_{t+h}^{\text{LASSO}}=\boldsymbol{x}_{t}\hat{\boldsymbol{\beta }}^{ \text{LASSO}}. \end{aligned}$$

The LASSO has been used in many applications and a general finding is that it is more likely to offer benefits relative to the OLS estimator when either (1) the number of regressors exceeds the number of observations, since it involves shrinkage, or (2) the number of parameters is large relative to the sample size, necessitating some form of regularization.

Recursive partitioning methods do not model the relationship between the explanatory variables and the outcome being forecasted with a regression model such as Eq. (10). Breiman et al. [10] propose a strategy known as classification and regression trees (CART), in which classification is used to forecast qualitative outcomes including categorical responses of non-numeric symbols and texts, and regression trees focus on quantitative response variables. Given the extreme volatility in Bitcoin gives rise to a continuous variable, we use regression trees (RT).

Consider a sample of $\{y_{t},\boldsymbol {x}_{t-h}\}_{t=1}^{n}$. Intuitively, RT operates in a similar manner to forward stepwise regression. A fast divide and conquer greedy algorithm considers all possible splits in each explanatory variable to recursively partition the data. Formally, at node τ containing n _τ observations with mean outcome $\overline {y }(\tau )$ of the tree can only be split by one selected explanatory variable into two leaves, denoted as τ _L and τ _R. The split is made at the explanatory variable which will lead to the largest reduction of a predetermined loss function between the two regions.¹² This splitting process continues at each new node until the gain to any forecast adds little value relative to a predetermined boundary. Forecasts at each final leaf are the fitted value from a local constant regression model.

Among machine learning strategies, the popularity of RT is high since the results of the analysis are easy to interpret. The algorithm that determines the split allows partitions among the entire covariate set to be described by a single tree. This contrasts with econometric approaches that begin by assuming a linear parametric form to explain the same process and as with the LASSO build a statistical model to make forecasts by selecting which explanatory variables to include. The tree structure considers the full set of explanatory variables and further allows for nonlinear predictor interactions that could be missed by conventional econometric approaches. The tree is simply a top-down, flowchart-like model which represents how the dataset was partitioned into numerous final leaf nodes. The predictions of a RT can be represented by a series of discontinuous flat surfaces forming an overall rough shape, whereas as we describe below visualizations of forecasts from other machine learning methods are not intuitive.

If the data are stationary and ergodic, the RT method often demonstrates gains in forecasting accuracy relative to OLS. Intuitively, we expect the RT method to perform well since it looks to partition the sample into subgroups with heterogeneous features. With time series data, it is likely that these splits will coincide with jumps and structural breaks. However, with primarily cross-sectional data, the statistical learning literature has discovered that individual regression trees are not powerful predictors relative to ensemble methods since they exhibit large variance [21].

Ensemble methods combine estimates from multiple outputs. Bootstrap aggregating decision trees (aka bagging) proposed in [8] and random forest (RF) developed in [9] are randomization-based ensemble methods. In bagging trees (BAG), trees are built on random bootstrap copies of the original data. The BAG algorithm is summarized as below:

(i)

Take a random sample with replacement from the data.

(ii)

Construct a regression tree.

(iii)

Use the regression tree to make forecast, $\hat f$.

(iv)

Repeat steps (i) to (iii), b = 1, …, B times and obtain $\hat f^b$ for each b.

(v)

Take a simple average of the B forecasts $\hat f_{\text{BAG}} = \frac {1}{B}\sum ^B_{b=1}\hat f^b $ and consider the averaged value $\hat f_{ \text{BAG}}$ as the final forecast.

Forecast accuracy generally increases with the number of bootstrap samples in the training process. However, more bootstrap samples increase computational time. RF can be regarded as a less computationally intensive modification of BAG. Similar to BAG, RF also constructs B new trees with (conventional or moving block) bootstrap samples from the original dataset. With RF, at each node of every tree only a random sample (without replacement) of q predictors out of the total K (q < K) predictors is considered to make a split. This process is repeated and the remaining steps (iii)–(v) of the BAG algorithm are followed. Only if q = K, RF is roughly equivalent to BAG. RF forecasts involve B trees like BAG, but these trees are less correlated with each other since fewer variables are considered for a split at each node. The final RF forecast is calculated as the simple average of forecasts from each of these B trees.

The RT method can respond to highly local features in the data and is quite flexible at capturing nonlinear relationships. The final machine learning strategy we consider refines how highly local features of the data are captured. This strategy is known as boosting trees and was introduced in [21, Chapter 10]. Observations responsible for the local variation are given more weight in the fitting process. If the algorithm continues to fit those observations poorly, we reapply the algorithm with increased weight placed on those observations.

We consider a simple least squares boosting that fits RT ensembles (BOOST). Regression trees partition the space of all joint predictor variable values into disjoint regions R _j, j = 1, 2, …, J, as represented by the terminal nodes of the tree. A constant j is assigned to each such region and the predictive rule is X ∈ R _j ⇒ f(X) = γ _j, where X is the matrix with tth component x _t−h. Thus, a tree can be formally expressed as $T( \boldsymbol {X},\varTheta )=\sum _{j=1}^{J}\gamma _{j}\mathbb {I}(\boldsymbol {X} \in R_{j}),$ with parameters $\varTheta =\{R_{j},\gamma _{j}\}_{j=1}^{J}$. The parameters are found by minimizing the risk

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Equh_HTML.png

where $\mathcal L(\cdot )$ is the loss function, for example, the sum of squared residuals (SSR).

The BOOST method is a sum of all trees:

$$\displaystyle \begin{aligned} f_M(\boldsymbol{X}) = \sum^M_{m=1}T(\boldsymbol{X};\varTheta_m) \end{aligned}$$

induced in a forward stagewise manner. At each step in the forward stagewise procedure, one must solve

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Equ13_HTML.png

(12)

for the region set and constants $\varTheta _m = \{R_{jm},\gamma _{jm}\}^{J_m}_1$ of the next tree, given the current model f _m−1(X). For squared-error loss, the solution is quite straightforward. It is simply the regression tree that best predicts the current residuals y _t − f _m−1(x _t−h), and $\hat \gamma _{jm}$ is the mean of these residuals in each corresponding region.

A popular alternative to a tree-based procedure to solve regression problems developed in the machine learning literature is the support vector regression (SVR). SVR has been found in numerous applications including Lehrer and Xie [26] to perform well in settings where there a small number of observations (< 500). Support vector regression is an extension of the support vector machine classification method of Vapnik [40]. The key feature of this algorithm is that it solves for a best fitting hyperplane using a learning algorithm that infers the functional relationships in the underlying dataset by following the structural risk minimization induction principle of Vapnik [40]. Since it looks for a functional relationship, it can find nonlinearities that many econometric procedures may miss using a prior chosen mapping that transforms the original data into a higher dimensional space.

Support vector regression was introduced in [16] and the true data that one wishes to forecast was known to be generated as y _t = f(x _t) + e _t, where f is unknown to the researcher and e _t is the error term. The SVR framework approximates f(x _t) in terms of a set of basis functions: $\{h_s(\cdot )\}^S_{s=1}$:

$$\displaystyle \begin{aligned} y_{t}=f({ {x}}_{t})+e_t =\sum_{s=1}^{S}\beta _{s}h_{s}({ {x}}_{t})+e_{t}, \end{aligned}$$

where h _s(⋅) is implicit and can be infinite-dimensional. The coefficients β = [β ₁, ⋯ , β _S]^⊤ are estimated through the minimization of

$$\displaystyle \begin{aligned} H( {\beta })=\sum_{t=1}^{T}V_{\epsilon}\left( y_{t}-f( {x} _{t})\right) +{\lambda }\sum_{s=1}^{S}\beta _{s}^{2}, {} \end{aligned} $$

(13)

where the loss function

$$\displaystyle \begin{aligned} V_{\epsilon}(r)=\left\{ \begin{array}{cl} 0 & \text{if }|r|<\epsilon \\ |r|-\epsilon & \text{otherwise} \end{array} \right. \end{aligned}$$

is called an 𝜖-insensitive error measure that ignores errors of size less than 𝜖. The parameter 𝜖 is usually decided beforehand and λ can be estimated by cross-validation.

Suykens and Vandewalle [38] proposed a modification to the classic SVR that eliminates the hyperparameter 𝜖 and replaces the original 𝜖-insensitive loss function with a least squares loss function. This is known as the least squares SVR (LSSVR). The LSSVR considers minimizing

$$\displaystyle \begin{aligned} H(\boldsymbol{\beta })=\sum_{t=1}^{T}\left( y_{t}-f( {x} _{t})\right) ^{2}+{\lambda }\sum_{s=1}^{S}\beta _{s}^{2}, {} \end{aligned} $$

(14)

where a squared loss function replaces V _e(⋅) for the LSSVR.

Estimating the nonlinear algorithms (13) and (14) requires a kernel-based procedure that can be interpreted as mapping the data from the original input space into a potentially higher-dimensional “feature space,” where linear methods may then be used for estimation. The use of kernels enables us to avoid paying the computational penalty implicit in the number of dimensions, since it is possible to evaluate the training data in the feature space through indirect evaluation of the inner products. As such, the kernel function is essential to the performance of SVR and LSSVR since it contains all the information available in the model and training data to perform supervised learning, with the sole exception of having measures of the outcome variable. Formally, we define the kernel function K(x, x _t) = h(x)h(x _t)^⊤ as the linear dot product of the nonlinear mapping for any input variable x. In our analysis, we consider the Gaussian kernel (sometimes referred to as “radial basis function” and “Gaussian radial basis function” in the support vector literature):

$$\displaystyle \begin{aligned} K(\boldsymbol x, { {x}}_{t})=\exp \left( -\frac{\Vert {\boldsymbol x - {x}}_{t}\Vert ^{2}}{2\sigma _{x}^{2}}\right), \end{aligned}$$

where the hyperparameters $\sigma _{x}^{2}$ and γ.

In our main analysis, we use a tenfold cross-validation to pick the tuning parameters for LASSO, SVR, and LSSVR. For tree-type machine learning methods, we set the basic hyperparameters of a regression tree at their default values. These include but not limited to: (1) the split criterion is SSR; (2) the maximum number of split is 10 for BOOST and n − 1 for others; (3) the minimum leaf size is 1; (4) the number of predictors for split is K∕3 for RF and K for others; and (5) the number of learning cycles is B = 100 for ensemble learning methods. We examine the robustness to different values for the hyperparameters in Sect. 7.3.

Substantial progress has been made in the machine learning literature on quickly converting text to data, generating real-time information on social media content. To measure social media sentiment, we selected an algorithm introduced in [17] that pre-trained a five-hidden-layer neural model on 124.6 million tweets containing emojis in order to learn better representations of the emotional context embedded in the tweet. This algorithm was developed to provide a means to learn representations of emotional content in texts and is available with pre-processing code, examples of usage, and benchmark datasets, among other features at http://www.github.com/bfelbo/deepmoji. The pre-training data is split into a training, validation, and test set, where the validation and test set are randomly sampled in such a way that each emoji is equally represented. This data includes all English Twitter messages without URLs within the period considered that contained an emoji. The fifth layer of the algorithm focuses on attention and takes inputs from the prior levels which uses a multi-class learners to decode the text and emojis itself. See [17] for further details. Thus, an emoji is viewed as a labeling system for emotional content.

The construction of the algorithm began by acquiring a dataset of 55 billion tweets, of which all tweets with emojis were used to train a deep learning model. That is, the text in the tweet was used to predict which emoji was included with what tweet. The premise of this algorithm is that if it could understand which emoji was included with a given sentence in the tweet, then it has a good understanding of the emotional content of that sentence. The goal of the algorithm is to understand the emotions underlying from the words that an individual tweets. The key feature of this algorithm compared to one that simply scores words themselves is that it is better able to detect irony and sarcasm. As such, the algorithm does not score individual emotion words in a Twitter message, but rather calculates a score based on the probability of each of 64 different emojis capturing the sentiment in the full Twitter message taking the structure of the sentence into consideration. Thus, each emoji has a fixed score and the sentiment of a message is a weighted average of the type of mood being conveyed, since messages containing multiple words are translated to a set of emojis to capture the emotion of the words within.

In brief, for a random sample of 10% of all tweets every minute, the score is calculated as an equal tweet weight average of the sentiment values of the words within them.¹³ That is, we apply the pre-trained classifier of Felbo et al. [17] to score each of these tweets and note that there are computational challenges related to data storage when using very large datasets to undertake sentiment analysis. In our application, the number of tweets per hour generally varies between 120,000 and 200,000 tweets per hour in our 10% random sample. We denote the minute-level sentiment index as the U.S. Sentiment Index (USSI).

In other words, if there are 10,000 tweets each hour, we first convert each tweet to a set of emojis. Then we convert the emojis to numerical values based on a fixed mapping related to their emotional content. For each of the 10,000 tweets posted in that hour, we next calculate the average of these scores as the emotion content or sentiment of that individual tweet. We then calculate the equal weighted average of these tweet-specific scores to gain an hourly measure. Thus, each tweet is treated equally irrespective of whether one tweet contains more emojis than the other. This is then repeated for each hour of each day in our sample providing us with a large time series.

Similar to many other text mining tasks, this sentiment analysis was initially designed to deal with English text. It would be simple to apply an off-the-shelf machine translation tool in the spirit of Google translate to generate pseudo-parallel corpora and then learn bilingual representations for downstream sentiment classification task of tweets that were initially posted in different languages. That said, due to the ubiquitous usage of emojis across languages and their functionality of expressing sentiment, alternative emoji powered algorithms have been developed with other languages. These have smaller training datasets since most tweets are in English and it is an open question as to whether they perform better than applying the [17] algorithm to pseudo-tweets.

Note that the way we construct USSI does not necessarily focus on sentiment related to cyptocurrency only as in [29]. Sentiment, in- and off-market, has been a major factor affecting the price of financial asset [23]. Empirical works have documented that large national sentiment swing can cause large fluctuation in asset prices, for example, [5, 37]. It is therefore natural to assume that national sentiment can affect financial market volatility.

Data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. Since USSI is constructed at minute level, we convert the minute-level USSI to match the daily sampling frequency of Bitcoin RV using the heterogeneous mixed data sampling (H-MIDAS) method of Lehrer et al. [28].¹⁴ This allows us to transform 1,172,747 minute-level observations for USSI variable via a step function to allow for heterogeneous effects of different high-frequency observations into 775 daily observations for the USSI at different forecast horizons. This step function produces a different weight on the hourly levels in the time series and can capture the relative importance of user’s emotional content across the day since the type of users varies in a manner that may be related to BTC volatility. The estimated weights used in the H-MIDAS transformation for our application are presented in Fig. 2.

Last, Table 1 presents the summary statistics for the RV data and p -values from both the Jarque–Bera test for normality and the Augmented Dickey–Fuller (ADF) tests for unit root. We consider the first half sample, the second half sample, and full sample. Each of the series exhibits tremendous variability and a large range across the sample period. Further, none of the series are normally distributed or nonstationary at 5% level.

Table 1

Descriptive statistics

Statistics	Realized variance			USSI
	First half	Second half	Full sample
Mean	43.4667	12.1959	27.8313	117.4024
Median	31.2213	7.0108	17.4019	125.8772
Maximum	197.6081	115.6538	197.6081	657.4327
Minimum	5.0327	0.5241	0.5241	− 866.6793
Std. dev.	38.0177	15.6177	32.9815	179.1662
Skewness	2.1470	3.3633	2.6013	− 0.8223
Kurtosis	7.8369	18.2259	11.2147	5.8747
Jarque–Bera	0.0000	0.0000	0.0000	0.0000
ADF test	0.0000	0.0000	0.0000	0.0000

Table 2

List of estimators

Panel A: conventional regression
(1)	AR(1)	A simple autoregressive model
(2)	HAR-Full	The HAR model proposed in [11] with l = [1, 2, …, 30], which is equivalent to AR(30)
(3)	HAR	The conventional HAR model proposed in [11] with l = [1, 7, 30]
(4)	HAR-J	The HAR model with jump component proposed in [4]
(5)	HAR-CJ	The HAR model with continuous jump component proposed in [4]
(6)	HAR-RS-I	The HAR model with semi-variance components (Type I) proposed in [34]
(7)	HAR-RS-II	The HAR model with semi-variance components (Type II) proposed in [34]
(8)	HAR-SJ-I	The HAR model with semi-variance and jump components (Type I) proposed in [34]
(9)	HAR-SJ-II	The HAR model with semi-variance and jump components (Type II) proposed in [34]
Panel B: machine learning strategy
(10)	LASSO	The least absolute shrinkage and selection operator by Tibshirani [39]
(11)	RT	The regression tree method proposed by Breiman et al. [10]
(12)	BOOST	The boosting tree method described in [21]
(13)	BAG	The bagging tree method proposed by Breiman [8]
(14)	RF	The random forest method proposed by Breiman [9]
(15)	SVR	The support vector machine for regression by Drucker et al. [16]
(16)	LSSVR	The least squares support vector regression by Suykens and Vandewalle [38]

6 Empirical Exercise

To examine the relative prediction efficiency of different HAR estimators, we conduct an h-step-ahead rolling window exercise of forecasting the BTC/USD RV for different forecasting horizons.¹⁵ Table 2 lists each estimator analyzed in the exercise. For all the HAR-type estimators in Panel A (except the HAR-Full model which uses all the lagged covariates from 1 to 30), we set l = [1, 7, 30] . For the machine learning methods in Panel B, the input data includes all covariates as the one for HAR-Full model. Throughout the experiment, the window length is fixed at WL = 400 observations. Our conclusions are robust to other window lengths as discussed in Sect. 7.1.

Table 3

Forecasting performance of strategies in the main exercise

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figb_HTML.png

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figc_HTML.png

The best result under each criterion is highlighted in boldface

To examine if the sentiment data extracted from social media improves forecasts, we contrasted the forecast from models that exclude the USSI to models that include the USSI as a predictor. We denote methods incorporating the USSI variable with ∗ symbol in each table. The results of the prediction experiment are presented in Table 3. The estimation strategy is listed in the first column and the remaining columns present alternative criteria to evaluate the forecasting performance. The criteria include the mean squared forecast error (MSFE), quasi-likelihood (QLIKE), mean absolute forecast error (MAFE), and standard deviation of forecast error (SDFE) that are calculated as

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{MSFE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}{e}^2_{T_{j},h}, {} \end{array} \end{aligned} $$

(15)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{QLIKE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}\left(\log\hat{y}_{T_{j},h}+\frac{y_{T_{j},h}}{\hat{y}_{T_{j},h} }\right), \end{array} \end{aligned} $$

(16)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{MAFE}(h)& =&\displaystyle \frac{1}{V}\sum_{j=1}^{V}|e_{T_{j},h}| , {} \end{array} \end{aligned} $$

(17)

$$\displaystyle \begin{aligned} \begin{array}{rcl} \text{SDFE}(h)& =&\displaystyle \sqrt{\frac{1}{V-1}\left( e_{T_{j},h}-\frac{1}{V} \sum_{j=1}^{V}e_{T_{j},h}\right) ^{2}}, {} \end{array} \end{aligned} $$

(18)

where $e_{T_{j},h}=y_{T_{j},h}-\hat {y}_{T_{j},h}$ is the forecast error and $\hat {y}_{iT_{j},h}$ is the h-day ahead forecast with information up to T _j that stands for the last observation in each of the V rolling windows. We also report the Pseudo R ² of the Mincer–Zarnowitz regression [32] given by:

$$\displaystyle \begin{aligned} y_{T_{j},h}=a+b\hat{y}_{T_{j},h}+u_{T_{j}},\text{for }j=1,2,\ldots ,V, {} \end{aligned} $$

(19)

Each panel in Table 3 presents the result corresponding to a specific forecasting horizon. We consider various forecasting horizons h = 1, 2, 4, and 7.

To ease interpretation, we focus on the following representative methods: HAR, HAR-CJ, HAR-RS-II, LASSO, RF, BAG, and LSSVR with and without the USSI variable. Comparison results between all methods listed in Table 2 are available upon request. We find consistent ranking of modeling methods across all forecast horizons. The tree-based machine learning methods (BAG and RF) have superior performance than all others for each panel. Moreover, methods with USSI (indicated by ∗) always dominate those without USSI, which indicates the importance of incorporating social media sentiment data. We also discover that the conventional econometric methods have unstable performance, for example, the HAR-RS-II model without USSI has the worst performance when h = 1, but its performance improves when h = 2. The mixed performance of the linear models implies that this restrictive formulation may not be robust to model the highly volatile BTC/USD RV process.

To examine if the improvement from the BAG and RF methods is statistically significant, we perform the modified Giacomini–White test [18] of the null hypothesis that the column method performs equally well as the row method in terms of MAFE. The corresponding p values are presented in Table 4 for h = 1, 2, 4, 7. We see that the gains in forecast accuracy from BAG^∗ and RF^∗ relative to all other strategies are statistically significant, although results between BAG^∗ and RF^∗ are statistically indistinguishable.

Table 4

Giacomini–White test results

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figd_HTML.png

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Fige_HTML.png

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figf_HTML.png

p-values smaller than 5% are highlighted in boldface

Table 5

Forecasting performance by different window lengths (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figg_HTML.png

The best result under each criterion is highlighted in boldface

7 Robustness Check

In this section, we perform four robustness checks of our main results. We first vary the window length for the rolling window exercise in Sect. 7.1. We next consider different sample periods in Sect. 7.2. We explore the use of different hyperparameters for the machine learning methods in Sect. 7.3. Our final robustness check examines if BTC/USD RV is correlated with other types of financial markets by including mainstream assets RV as additional covariates. Each of these robustness checks that are ported in the main text considers h = 1.¹⁶

7.1 Different Window Lengths

In the main exercise, we set the window length WL = 400. In this section, we also tried other window lengths WL = 300 and 500. Table 5 shows the forecasting performance of all the estimators for various window lengths. In all the cases BAG^∗ and RF^∗ yield smallest MSFE, MAFE, and SDFE and the largest Pseudo R ². We examine the statistical significance of the improvement on forecasting accuracy in Table 6. The small p-values on testing BAG^∗ and RF^∗ against other strategies indicate that the forecasting accuracy improvement is statistically significant at the 5% level.

Table 6

Giacomini–White test results by different window lengths (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figh_HTML.png

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figi_HTML.png

p-values smaller than 5% are highlighted in boldface

7.2 Different Sample Periods

In this section, we partition the entire sample period in half: the first subsample period runs from May 20, 2015, to July 29, 2016, and the second subsample period runs from July 30, 2016, to Aug 20, 2017. We carry out the similar out-of-sample analysis with WL = 200 for the two subsamples in Table 7 Panels A and B, respectively. We also examine the statistical significance in Table 8. The previous conclusions remain basically unchanged under the subsamples.

Table 7

Forecasting performance by different sample periods (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figj_HTML.png

The best result under each criterion is highlighted in boldface

Table 8

Giacomini–White test results by different sample periods (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figk_HTML.png

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figl_HTML.png

p-values smaller than 5% are highlighted in boldface

7.3 Different Tuning Parameters

In this section, we examine the effect of different tuning parameters for the machine learning methods. We consider a different set of tuning parameters: B = 20 for RF and BAG, and λ = 0.5 for LASSO, SVR, and LSSVR. The machine learning methods with the second set of tuning parameters are labeled as RF2, BAG2, and LASSO2. We replicate the main empirical exercise in Sect. 6 and compare the performance of machine learning methods with different tuning parameters.

The results are presented in Tables 9 and 10. Changes in the considered tuning parameters generally have marginal effects on the forecasting performance, although the results for the second tuning parameters are slightly worse than those under the default setting. Last, social media sentiment data plays a crucial role on improving the out-of-sample performance in each of these exercises.

Table 9

Forecasting performance by different tuning parameters (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figm_HTML.png

The best result under each criterion is highlighted in boldface

Table 10

Giacomini–White test results by different tuning parameters (h = 1)

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Fign_HTML.png

p-values smaller than 5% are highlighted in boldface

7.4 Incorporating Mainstream Assets as Extra Covariates

In this section, we examine if the mainstream asset class has spillover effect on BTC/USD RV. We include the RVs of the S&P and NASDAQ indices ETFs (ticker names: SPY and QQQ, respectively) and the CBOE Volatility Index (VIX) as extra covariates. For SPY and QQQ, we proxy daily spot variances by daily realized variance estimates. For the VIX, we collect the daily data from CBOE. The extra covariates are described in Table 11

Table 11

Descriptive statistics

Statistics	SPY	QQQ	VIX
Mean	0.3839	0.7043	15.0144
Median	0.2034	0.3515	13.7300
Maximum	12.1637	70.6806	40.7400
Minimum	0.0143	0.0468	9.3600
Std. Dev.	0.6946	3.1108	4.5005
Skewness	10.1587	21.3288	1.6188
Kurtosis	158.5806	479.5436	6.3394
Jarque–Bera	0.0010	0.0010	0.0010
ADF Test	0.0010	0.0010	0.0010

The data range is from May 20, 2015, to August 18, 2017, with 536 total observations. Fewer observations are available since mainstream asset exchanges are closed on the weekends and holidays. We truncate the BTC/USD data accordingly. We compare forecasts from models with two groups of covariate data: one with only the USSI variable and the other which includes both the USSI variable and the mainstream RV data (SPY, QQQ, and VIX). Estimates that include the larger covariate set are denoted by the symbol ∗∗.

The rolling window forecasting results with WL = 300 are presented in Table 12. Comparing results across any strategy between Panels A and B, we do not observe obvious improvements in forecasting accuracy. This implies that mainstream asset markets RV does not affect BTC/USD volatility, which reinforces the fact that crypto-assets are sometimes considered as a hedging device for many investment companies.¹⁷

Table 12

Forecasting performance

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figo_HTML.png

The best result under each criterion is highlighted in boldface

Last, we use the GW test to formally explore if there are no differences in forecast accuracy between the panels in Table 13. For each estimator, we present the p-values from different covariate groups in bold. Each of these p-values exceeds 5%, which support our finding that mainstream asset RV data does not improve forecasts sharply, unlike the inclusion of social media data.

Table 13

Giacomini–White test results

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Figp_HTML.png

p-values smaller than 5% are highlighted in boldface

8 Conclusion

In this chapter, we compare the performance of numerous econometric and machine learning forecasting strategies to explain the short-term realized volatility of the Bitcoin market. Our results first complement a rapidly growing body of research that finds benefits from using machine learning techniques in the context of financial forecasting. Our application involves forecasting an asset that exhibits significantly more variation than much of the earlier literature which could present challenges in settings such as ours with fewer than 800 observations. Yet, our result further highlights that what drives the benefits of machine learning is the accounting for nonlinearities and there are much smaller gains from using regularization or cross-validation. Second, we find substantial benefits from using social media data in our forecasting exercise that hold irrespective of the estimator. These benefits are larger when we consider new econometric tools to more flexibly handle the difference in the timing of the sampling of social media and financial data.

Taken together, there are benefits from using both new data sources from the social web and predictive techniques developed in the machine learning literature for forecasting financial data. We suggest that the benefits from these tools will likely increase as researchers begin to understand why they work and what they measure. While our analysis suggests nonlinearities are important to account for, more work is needed to incorporate heterogeneity from heteroskedastic data in machine learning algorithms.¹⁸ We observe significant differences between SVR and LSSVR so the change in loss function can explain a portion of the gains within machine learning relative to econometric strategies, but not to the same extent as nonlinearities, which the tree-based strategies also account for and use a similar loss function based on SSR.

Our investigation focused on the performance of what are currently the most popular algorithms considered by social scientists. There have been many advances developing powerful algorithms in the machine learning literature including deep learning procedures which consider more hidden layers than the neural network procedures considered in the econometrics literature between 1995 and 2015. Similarly, among tree-based procedures, we did not consider eXtreme gradient boosting which applies more penalties in the boosting equation when updating trees and residual compared to the classic boosting method we employed. Both eXtreme gradient boosting and deep learning methods present significant challenges regarding interpretability relative to the algorithms we examined in the empirical exercise.

Further, machine learning algorithms were not developed for time series data and more work is needed to develop methods that can account for serial dependence, long memory, as well as the consequences of having heterogeneous investors.¹⁹ That is, while time series forecasting is an important area of machine learning (see [19, 30], for recent overviews that consider both one-step-ahead and multi-horizon time series forecasting), concepts such as autocorrelation and stationarity which pervade developments in financial econometrics have received less attention. We believe there is potential for hybrid approaches in the spirit of Lehrer and Xie [25] with group LASSO estimators. Further, developing machine learning approaches that consider interpretability appears crucial for many forecasting exercises whose results need to be conveyed to business leaders who want to make data-driven decisions. Last, given the random sample of Twitter users from which we measure sentiment, there is likely measurement error in our sentiment and our estimate should be interpreted as a lower bound.

Given the empirical importance of incorporating social media data in our forecasting models, there is substantial scope for further work that generates new insights with finer measures of this data. For example, future work could consider extracting Twitter messages that only capture the views of market participants rather than the entire universe of Twitter users. Work is also needed to clearly identify bots and consider how best to handle fake Twitter accounts. Similarly, research could strive to understand shifting sentiment for different groups on social media in response to news events. This can help improve our understanding of how responses to unexpected news leads lead investors to reallocate across asset classes. ²⁰

In summary, we remain at the early stages of extracting the full set of benefits from machine learning tools used to measure sentiment and conduct predictive analytics. For example, the Bitcoin market is international but the tweets used to estimate sentiment in our analysis were initially written in English. Whether the findings are robust to the inclusion of Tweets posted in other languages represents an open question for future research. As our understanding of how to account for real-world features of data increases with these data science tools, the full hype of machine learning and data science may be realized.

Acknowledgements

We wish to thank Yue Qiu, Jun Yu, and Tao Zeng, seminar participants at Singapore Management University, for helpful comments and suggestions. Xie’s research is supported by the Natural Science Foundation of China (71701175), the Chinese Ministry of Education Project of Humanities and Social Sciences (17YJC790174), and the Fundamental Research Funds for the Central Universities. Contact Tian Xie (e mail: xietian@shufe.edu.cn) for any questions concerning the data and/or codes. The usual caveat applies.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Quantifying News Narratives to Predict Movements in Market Risk

next chapter Network Analysis for Economics and Finance: An Application to Firm Ownership

Appendix: Data Resampling Techniques

Substantial progress has been made in the machine learning literature on quickly converting text to data, generating real-time information on social media content. In this study, we also explore the benefits of incorporating an aggregate measure of social media sentiment, the Wall Street Journal-IHS Markit US Sentiment Index (USSI) in forecasting the Bitcoin RV. However, data timing presents a serious challenge in using minutely measures of the USSI to forecast the daily Bitcoin RV. To convert minutely USSI measure to match the sampling frequency of Bitcoin RV, we hereby introduce a few popular data resampling techniques.

Let y _t+h be target h-step-ahead future a low-frequency variable (e.g., the daily realized variance) that is sampled at periods denoted by a time index t for t = 1, …, n. Consider a higher-frequency (e.g., the USSI) predictor $\boldsymbol {X}^{hi}_t$ that is sampled m times within the period of t:

$$\displaystyle \begin{aligned} \boldsymbol{X}_{t}^{h}\equiv\left[X^{hi}_{t}, X^{hi}_{t-\frac{1}{m}}, \ldots, X^{hi}_{t-\frac{m-1}{m}}\right]^{\top }. {} \end{aligned} $$

(20)

A specific element among the high-frequency observations in $\boldsymbol {X} ^{hi}_t$ is denoted by $X^{hi}_{t-\frac {i}{m}}$ for i = 0, …, m − 1. Denoting L ^i∕m as the lag operator, then $X^{hi}_{t-\frac {i}{m}}$ can be reexpressed as $X^{hi}_{t-\frac {i}{m}} = L^{i/m} X^{hi}_t$ for i = 0, …, m − 1.

Since $\boldsymbol {X}_{t}^{h}$ on y _t+h is measured at different frequencies, we need to convert the higher-frequency data to match the lower-frequency data. A simple average of the high-frequency observations $ \boldsymbol {X}_{t}^{h}$:

$$\displaystyle \begin{aligned} \bar{X}_{t}=\frac{1}{m}\sum_{i=0}^{m-1}L^{i/m}X_{t}^{h}, \end{aligned}$$

where $\bar {X}_{t}$ is likely the easiest way to estimate a low-frequency X _t that can match the frequency of y _t+h. With the variables y _t+h and $\bar {X}_{t}$ being measured in the same time domain, a regression approach is simply

$$\displaystyle \begin{aligned} y_{t+h}={\alpha }+\gamma \bar{X}_{t}+\epsilon _{t}={\alpha }+\frac{\gamma }{m }\sum_{i=0}^{m-1}L^{i/m}X_{t}^{h}+\epsilon _{t}, {} \end{aligned} $$

(21)

where α is the intercept and γ is the slope coefficient on the time-averaged $\bar {X}_{t}$. This approach assumes that each element in $ \boldsymbol {X}_{t}^{h}$ has an identical effect on explaining y _t+h.

These homogeneity assumptions may be quite strong in practice. One could assume that each of the slope coefficients for each element in $\boldsymbol {X }^{hi}_t$ is unique. Following Lehrer et al. [28], extending Model (21) to allow for heterogeneous effects of the high-frequency observations generates

$$\displaystyle \begin{aligned} y_{t+h}= {\alpha }+\sum_{i=0}^{m-1}\gamma _{i}L^{i/m}X_{t}^{hi}+\epsilon _{t}, {} \end{aligned} $$

(22)

where γ _i represents a set of slope coefficients for all high-frequency observations $X^{hi}_{t-\frac {i}{m}}$.

Since γ _i is unknown, estimating these parameters can be problematic when m is a relatively large number. The heterogeneous mixed data sampling (H-MIDAS) method by Lehrer et al. [28] uses a step function to allow for heterogeneous effects of different high-frequency observations on the low-frequency dependent variable. A low-frequency $\bar {X}_{t}^{\left ( l\right ) } $ can be constructed following

$$\displaystyle \begin{aligned} \bar{X}_{t}^{\left( l\right) } \equiv \frac{1}{l} \sum_{i=0}^{l-1}L^{i/m}X^{hi}_{t}=\frac{1}{l}\sum_{i=0}^{l-1}X^{hi}_{t-\frac{ i}{m}}, {} \end{aligned} $$

(23)

where l is a predetermined number and l ≤ m. Equation (23) implies that we compute variable $\bar {X}_{t}^{\left ( l\right ) }$ by a simple average of the first l observations in $\boldsymbol {X}^{hi}_t$ and ignored the remaining observations. We consider different values of l and group all $\bar {X}_{t}^{\left ( l\right ) }$ into $\boldsymbol {\tilde {X}}_{t}$ such that

$$\displaystyle \begin{aligned} \boldsymbol{\tilde{X}}_{t}=\left[ \bar{X}_{t}^{\left( l_{1}\right) },\bar{X} _{t}^{\left( l_{2}\right) },\ldots ,\bar{X}_{t}^{\left( l_{p}\right) }\right] , \end{aligned}$$

where we set l ₁ < l ₂ < ⋯ < l _p. Consider a weight vector $ \boldsymbol {w=}\left [ w_{1},w_{2},\ldots ,w_{p}\right ] ^{{ }^{\top }}$ with $ \sum _{j=1}^{p}w_{j}=1$; we can construct regressor ${X}_{t}^{new}$ as ${X} _{t}^{new}=\boldsymbol {\tilde {X}}_{t} \boldsymbol {w}. $ The regression based on the H-MIDAS estimator can be expressed as

$$\displaystyle \begin{aligned} y_{t+h} =\beta {X}_{t}^{new}+\epsilon _{t} = \beta \sum_{s=1}^{p}\sum_{j=s}^{p}\frac{w_{j}}{l_{j}} \sum_{i=l_{s-1}}^{l_{s}-1}L^{i/m}X_{t}^{h}+\epsilon _{t} = \beta \sum_{s=1}^{p}\sum_{i=l_{s-1}}^{l_{s}-1}w_{s}^{\ast }L^{i/m}X_{t}^{h}+\epsilon _{t}, {} \end{aligned} $$

(24)

where l ₀ = 0 and $w_{s}^{\ast }=$ $\sum _{j=s}^{p}\frac {w_{j}}{l_{j}}$.

The weights w play a crucial role in this procedure. We first estimate $\widehat {\beta \boldsymbol {w}}$ following

https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-66891-4_13/MediaObjects/491778_1_En_13_Equo_HTML.png

by any appropriate econometric method necessary, where $\mathcal {W}$ is some predetermined weight set. Once $\widehat {\beta \boldsymbol {w}}$ is obtained, we estimate the weight vector $\hat {\boldsymbol {w}}$ by rescaling following

$$\displaystyle \begin{aligned} \hat{\boldsymbol{w}} = \frac{\widehat{\beta\boldsymbol{w}}}{\text{Sum}( \widehat{\beta\boldsymbol{w}})}, \end{aligned}$$

since the coefficient β is a scalar.

See [25, 26], for example, with data from the film industry that conducts horse races between various strategies. Medeiros et al. [31] use the random forest estimator to examine the benefits of machine learning for forecasting inflation. Last, Coulombe et al. [13] conclude that the benefits from machine learning over econometric approaches for macroeconomic forecasting arise since they capture important nonlinearities that arise in the context of uncertainty and financial frictions.

Traditional econometric approaches to model and forecast such as the parametric GARCH or stochastic volatility models include measures built on daily, weekly, and monthly frequency data. While popular, empirical studies indicate that they fail to capture all information in high-frequency data; see [1, 7, 20], among others.

This phenomenon has been documented by Dacorogna et al. [15] and Andersen et al. [3] for the foreign exchange market and by Andersen et al. [2] for stock market returns.

Corsi et al. [12] provide a comprehensive review of the development of HAR-type models and their various extensions. The HAR model provides an intuitive economic interpretation that agents with three frequencies of trading (daily, weekly, and monthly) perceive and respond to, which changes the corresponding components of volatility. Müller et al. [33] refer to this idea as the Heterogeneous Market Hypothesis. Nevertheless, the suitability of such a specification is not subject to enough verification. Craioveanu and Hillebrand [14] employ a parallel computing method to investigate all of the possible combinations of lags (chosen within a maximum lag of 250) for the last two terms in the additive model, and they compared their in-sample and out-of-sample fitting performance.

We note that the assumption of equal weight is strong. Mai et al. [29] find that social media sentiment is an important predictor in determining Bitcoin’s valuation, but not all social media messages are of equal impact. Yet, our measure of social media is collected from all Twitter users, a more diverse group than users of cryptocurrency forums in [29]. Thus, if we find any effect, it is likely a lower bound since our measure of social media sentiment likely has classical measurement error.

Mining is challenging since new blocks and miners are paid any transaction fees as well as a “subsidy” of newly created coins. For the new block to be considered valid, it must contain a proof of work that is verified by other Bitcoin nodes each time they receive a block. By downloading and verifying the blockchain, Bitcoin nodes are able to reach consensus about the ordering of events in Bitcoin. Any currency that is generated by a malicious user that does not follow the rules will be rejected by the network and thus is worthless. To make each new block more challenging to mine, the rate at which a new block can be found is recalculated every 2016 blocks increasing the difficulty.

For example, the legendary former Legg Mason’ Chief Investment Officer Bill Miller’s fund has been reported to have 50% exposure to crypto-assets. There is also a growing set of decentralized exchanges, including IDEX, 0x, etc., but their market shares remain low today. Furthermore, given the SEC’s recent charge against EtherDelta, a well-known Ethereum-based decentralized exchange, the future of decentralized exchanges faces significant uncertainties.

Apart from Bitcoin, there are more than 1600 other alter coin or cryptocurrencies listed over 200 different exchanges. However, Bitcoin still maintains roughly 50% market dominance. At the end of December 2018, the market capitalization of Bitcoin is roughly 65 billion USD with 3800 USD per token. On December 17, 2017, it reached 330 billion USD cap peak with almost 19,000 USD per Bitcoin according to Coinmarketcap.com.

Using the log to transform the realized variance is standard in the literature, motivated by avoiding imposing positive constraints and considering the residuals of the below regression to have heteroskedasticity related to the level of the process, as mentioned by Patton and Sheppard [34]. An alternative is to implement weighted least squares (WLS) on RV, which does not suit well our purpose of using the least squares model averaging method.

For example, Gu et al. [19] perform a comparative analysis of machine learning methods for measuring asset risk premia. Ban et al. [6] adopt machine learning methods for portfolio optimization. Beyond academic research, the popularity of algorithm-based quantitative exchange-traded funds (ETF) has increased among investors, in part since as LaFon [24] points out they both offer lower management fees and volatility than traditional stock-picking funds.

This is an impossibility theorem that rules out the possibility that a general-purpose universal optimization strategy exists. As such, researchers should examine the sensitivity of their findings to alternative strategies.

A best split is determined by a given loss function, for example, the reduction of the sum of squared residuals (SSR). A simple regression will yield a sum of squared residuals, SSR₀. Suppose we can split the original sample into two subsamples such that n = n ₁ + n ₂. The RT method finds the best split of a sample to minimize the SSR from the two subsamples. That is, the SSR values computed from each subsample should follow: SSR₁ + SSR₂ ≤ SSR₀.

This is a 10% random sample of all tweets since the USSI was designed to measure the real-time mood of the nation and the algorithm does not restrict the calculations to Twitter accounts that either mention any specific stock or are classified as being a market participant.

We provide full details on this strategy in the appendix. In practice, we need to select the lag index l = [l ₁, …, l _p] and determine the weight set $\mathcal {W}$ before the estimation. In this study, we set $\mathcal {W}\equiv \{\boldsymbol {w}\in \mathbb {R} ^{p}:\sum _{j=1}^{p}w_{j}=1\}$ and use OLS to estimate $\widehat {\beta \boldsymbol {w}}$. We consider h = 1, 2, 4, and 7 as in the main exercise. For the lag index, we consider l = [1 : 5 : 1440], given there are 1440 minutes per day.

Additional results using both the GARCH(1, 1) and the ARFIMA(p, d, q) models are available upon request. These estimators performed poorly relative to the HAR model and as such are not included for space considerations.

Although not reported due to space considerations, we investigated other forecasting horizons and our main findings are robust.

PwC-Elwood [36] suggests that the capitalization of cryptocurrency hedge funds increases at a steady pace since 2016.

Lehrer and Xie [26] pointed out that all of the machine learning algorithms considered in this paper assume homoskesdastic data. In their study, they discuss the consequences of heteroskedasticity for these algorithms and the resulting predictions, as well as propose alternatives for this data.

Lehrer et al. [27] considered the use of model averaging with HAR models to account for heterogeneous investors.

As an example, following the removal of Ivanka Trump’s fashion line from their stores, President Trump issued a statement via Twitter:

My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person – always pushing me to do the right thing! Terrible!

The general public response to this Tweet was to disagree with President Trump’s stance on Nordstrom so aggregate Twitter sentiment measures rose and the immediate negative effects from the Tweet on Nordstrom stock of a decline of 1% in the minute following the tweet were fleeting since the stock closed the session posting a gain of 4.1%. See http://www.marketwatch.com/story/nordstrom-recovers-from-trumps-terrible-tweet-in-just-4-minutes-2017-02-08 for more details on this episode.

Andersen, T. G., & Bollerslev, T. (1998). Answering the skeptics: yes, standard volatility models do provide accurate forecasts. International Economic Review, 39(4), 885–905.CrossRef

Andersen, T., Bollerslev, T., Diebold, F., & Ebens, H. (2001). The distribution of realized stock return volatility. Journal of Financial Economics, 61(1), 43–76.CrossRef

Andersen, T. G., Bollerslev, T., Diebold, F. X., & Labys, P. (2001). The distribution of realized exchange rate volatility. Journal of the American Statistical Associatio, 96(453), 42–55.MathSciNetCrossRef

Andersen, T. G., Bollerslev, T., & Diebold, F. X. (2007). Roughing it up: including jump components in the measurement, modelling, and forecasting of return volatility. The Review of Economics and Statistics, 89(4), 701–720.CrossRef

Baker, M., & Wurgler, J. (2007). Investor sentiment in the stock market. Journal of Economic Perspectives, 21(2), 129–152.CrossRef

Ban, G.-Y., Karoui, N. E., & Lim, A. E. B. (2018). Machine learning and portfolio optimization. Management Science, 64(3), 1136–1154.CrossRef

Blair, B. J., Poon, S.-H., & Taylor, S. J. (2001). Forecasting S&P 100 volatility: the incremental information content of implied volatilities and high-frequency index returns. Journal of Econometrics, 105(1), 5–26.MathSciNetCrossRef

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.MATH

Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRef

10.

Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. New York: Chapman and Hall/CRC.MATH

11.

Corsi, F. (2009). A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics, 7(2), 174–196.CrossRef

12.

Corsi, F., Audrino, F., & Renó, R. (2012). HAR modelling for realized volatility forecasting. In Handbook of volatility models and their applications (pp. 363–382). Hoboken: : John Wiley & Sons.

13.

Coulombe, P. G., Leroux, M., Stevanovic, D., & Surprenant, S. (2019). How is machine learning useful for macroeconomic forecasting? In Cirano Working Papers, CIRANO. https://economics.sas.upenn.edu/system/files/2019-03/GCLSS_MC_MacroFcst.pdf

14.

Craioveanu, M., & Hillebrand, E. (2012). Why it is ok to use the har-rv (1, 5, 21) model. Technical Report 1201, University of Central Missouri. https://ideas.repec.org/p/umn/wpaper/1201.html

15.

Dacorogna, M. M., Müller, U. A., Nagler, R. J., Olsen, R. B., & Pictet, O. V. (1993). A geographical model for the daily and weekly seasonal volatility in the foreign exchange market. Journal of International Money and Finance, 12(4), 413–438.CrossRef

16.

Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A. J., & Vapnik, V. (1996). Support vector regression machines. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 155–161). Cambridge: MIT Press.

17.

Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., & Lehmann, S. (2017). Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 1615–1625). Stroudsburg: Association for Computational Linguistics.

18.

Giacomini, R., & White, H. (2006). Tests of conditional predictive ability. Econometrica, 74(6), 1545–1578.MathSciNetCrossRef

19.

Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning. Review of Financial Studies, 33(5), 2223–2273. Society for Financial Studies.

20.

Hansen, P. R., & Lunde, A. (2005). A forecast comparison of volatility models: does anything beat a garch(1,1)? Journal of Applied Econometrics, 20(7), 873–889.MathSciNetCrossRef

21.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer series in statistics. New York, NY: Springer.MATH

22.

Huang, X., & Tauchen, G. (2005). The relative contribution of jumps to total price variance. Journal of Financial Econometrics, 3(4), 456–499.CrossRef

23.

Ke, Z. T., Kelly, B. T., & Xiu, D. (2019). Predicting returns with text data. In NBER Working Papers 26186. Cambridge: National Bureau of Economic Research, Inc.

24.

LaFon, H. (2017). Should you jump on the smart beta bandwagon? https://money.usnews.com/investing/funds/articles/2017-08-24/are-quant-etfs-worth-buying

25.

Lehrer, S. F., & Xie, T. (2017). Box office buzz: does social media data steal the show from model uncertainty when forecasting for hollywood? Review of Economics and Statistics, 99(5), 749–755.CrossRef

26.

Lehrer, S. F., & Xie, T. (2018). The bigger picture: Combining econometrics with analytics improve forecasts of movie success. In NBER Working Papers 24755. Cambridge: National Bureau of Economic Research.

27.

Lehrer, S. F., Xie, T., & Zhang, X. (2019). Does adding social media sentiment upstage admitting ignorance when forecasting volatility? Technical Report, Queen’s University, NY. Available at: http://econ.queensu.ca/faculty/lehrer/mahar.pdf

28.

Lehrer, S. F., Xie, T., & Zeng, T. (2019). Does high frequency social media data improve forecasts of low frequency consumer confidence measures? In NBER Working Papers 26505. Cambridge: National Bureau of Economic Research.CrossRef

29.

Mai, F., Shan, J., Bai, Q., Wang, S., & Chiang, R. (2018). How does social media impact bitcoin value? A test of the silent majority hypothesis. Journal of Management Information Systems, 35, 19–52.CrossRef

30.

Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and machine learning forecasting methods: concerns and ways forward. PloS One, 13(3), Article No. e0194889. https://doi.org/10.1371/journal.pone.0194889

31.

Medeiros, M. C., Vasconcelos, G. F. R., Veiga, Á., & Zilberman, E. (2019). Forecasting inflation in a data-rich environment: The benefits of machine learning methods. Journal of Business & Economic Statistics, 39(1), 98–119. https://doi.org/10.1080/07350015.2019.1637745 MathSciNetCrossRef

32.

Mincer, J., & Zarnowitz, V. (1969). The evaluation of economic forecasts. In Economic forecasts and expectations: Analysis of forecasting behavior and performance (pp. 3–46). Cambridge: National Bureau of Economic Research, Inc.

33.

Müller, U. A., Dacorogna, M. M., Davé, R. D., Pictet, O. V., Olsen, R. B., & Ward, J. (1993). Fractals and intrinsic time – a challenge to econometricians. Technical report SSRN 5370. https://ssrn.com/abstract=5370

34.

Patton, A. J., & Sheppard, K. (2015). Good volatility, bad volatility: signed jumps and the persistence of volatility. The Review of Economics and Statistics, 97(3), 683–697.CrossRef

35.

Probst, P., Boulesteix, A., & Bischl, B. (2019). Tunability: Importance of hyperparameters of machine learning algorithms. Journal of Machine Learning Research, 20, 1–32.MathSciNetMATH

36.

PwC-Elwood. (2019). 2019 crypto hedge fund report. https://www.pwc.com/gx/en/financial-services/fintech/assets/pwc-elwood-2019-annual-crypto-hedge-fund-report.pdf

37.

Schumaker, R. P., Zhang, Y., Huang, C.-N., & Chen, H. (2012). Evaluating sentiment in financial news articles. Decision Support Systems, 53(3), 458–464.CrossRef

38.

Suykens, J., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.CrossRef

39.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.MathSciNetMATH

40.

Vapnik, V. N. (1996). The nature of statistical learning theory. New York, NY: Springer-Verlag.MATH

41.

Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82.CrossRef

42.

Xie, T. (2019). Forecast bitcoin volatility with least squares model averaging. Econometrics, 7(3), 40:1–40:20.

Title: Do the Hype of the Benefits from Using New Data Science Tools Extend to Forecasting Extremely Volatile Assets?
Authors: Steven F. Lehrer
Tian Xie
Guanxi Yi
Publisher: Springer International Publishing
Book: Data Science for Economics and Finance
Print ISBN: 978-3-030-66890-7

Electronic ISBN: 978-3-030-66891-4

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-66891-4_13

Springer Professional

Do the Hype of the Benefits from Using New Data Science Tools Extend to Forecasting Extremely Volatile Assets?

Abstract

1 Introduction

2 What Is Bitcoin?

3 Bitcoin Data and HAR-Type Strategies to Forecast Volatility

4 Machine Learning Strategy to Forecast Volatility

6 Empirical Exercise