Introduction
-
The phenomenon of big data and the emergence of big data analytics are introduced, supporting the need for new machine learning algorithms to take advantage of this huge amount of data.
-
A general overview of related works that focus on the application of machine learning to financial time series is provided; in particular, the use of HMMs is discussed, followed by the main approaches for the optimization of HMMs.
-
We propose a new big data optimization method based on the use of the constraint satisfaction problem, consisting of a graph-based approach to enhance the learning and prediction tasks using HMMs.
-
We experimentally evaluate the proposed approach on a real-world dataset and compare it to the conventional HMM and to the reference models using the complexity, running time, and MAPE and other metrics as measures of the prediction accuracy.
Problem formulation
-
reject the nodes that cannot be reached under certain constraints and thereby reduce the state space dimension.
-
delete unnecessary arcs between nodes and thereby reduce the transition probability matrix.
-
speed up the learning process using the Baum-Welch algorithm while maintaining a very high level of accuracy.
Related work
Background
Hidden Markov models
Element | Description |
---|---|
\(\lambda \) | DHMM model, \(\lambda = (A, B, \Pi )\) or CHMM model, \(\lambda = (A, c_{jm}, \mu _{jm}, \Sigma _{jm}, \Pi )\). |
S | The state vectors of the HMM \(S = \{s_{1}, s_{2}, ..., s_{N}\}\), (N states). |
V | the observation \(V = \{v_{1}, v_{2}, ..., v_{M} \}\), (M observations). |
O | The observation sequence \(O = \{o_{1}, o_{2}, ..., o_{T} \}\). |
Q | The hidden state sequence \(Q = \{q_{1}, q_{2}, ..., q_{T} \}\), \(q_{t} \in S\). |
A | The transition matrix \(A=\{a_{ij} \}\), \(a_{ij}=P(q_{t+1}=s_{j} \mid q_{t}=s_{i})\), where \(1\le i,j \le N\). \(q_{t}\) is the state at time t. \(a_{ij}\) is the transition probability from state \(s_{i}\) to state \(s_{j}\). For every state \(s_{i}\), \(\sum _{j=1}^N a_{ij} =1\) and \(a_{ij}\ge 0\). |
B | The observation matrix \(B = \{b_{j}(o_{t})\}\), \(b_{j}(o_{t})=P(o_{t} \mid q_{t}=s_{j})\) is the probability of the \(t^{th}\) observation, which is observed in state \(s_{j}\). For continuous observation, \(b_{j}(o_{t})\) is a probability density function (pdf) or a mixture of continuous pdfs. \(o_{t}\) is the observation feature vector recorded at time t. \(b_{j}(o_{t})=\sum _{m=1}^{M} c_{jm}N(o_{t}, \mu _{jm},\Sigma _{jm})\), where \(N(O, \mu _{jm},\Sigma _{jm})=\sum _{m=1}^{M} c_{jm} \frac{1}{((2\pi )^{d}|\Sigma _{jm}|)^{1/2}}\exp \left( -\frac{1}{2}(o_{t}-\mu _{jm}) \Sigma _{jm}^{-1}(o_{t}-\mu _{jm})^T \right) \). M is the total number of mixtures and d is the dimension of \(o_{t}\). For every state \(s_{j}\), \(\sum _{t=1}^T b_{j}(o_{t}) =1\). |
\(\Pi \) | The stochastic initial distribution vector \(\Pi =\{\pi _{i} \}\), where \(\pi _{i}\) is the probability of \(s_{i}\) being the first state of a state sequence. \(\pi _{i} = P(q_{1}=s_{i})\), \(1\le i \le N\). \(\sum _{i=1}^N \pi _{i} =1\). |
\(P(O\mid \lambda )\) | The probability that a given sequence of observations \(O = \{o_{1}, o_{2}, ..., o_{T} \}\) are generated by a model \(\lambda \) with a given HMM. |
\(\alpha _t(i)\) | forward variable, defined as \(\alpha _t(i)=P(o_{1}, o_{2}, ..., o_{t},q_{t}=s_{i} \mid \lambda )\). |
\(\beta _t(i)\) | Backward variable, defined as \(\beta _{t}(i)=P(o_{t+1}o_{t+2}...o_{T}\mid q_{t}=s_{i},\lambda )\). |
\(\gamma _t(i)\) | The probability of being in state \(s_{i}\) at time t given \(\lambda \) and O. \(\gamma _t(i)=P(q_{t}=s_{i} \mid O, \lambda )\). |
\(\xi _t(i,j)\) | The probability of being in state \(s_{i}\) at time t and in state \(s_{j}\) at time \(t+1\) given the model parameter \(\lambda \) and the observation sequence O. \(\xi _{t}(i,j)=P(q_{t}=s_{i},q_{t+1}=s_{j} \mid O, \lambda )\). |
\(\gamma _t(j,m)\) | The probability that given the model parameter \(\lambda \), the observation \(o_{t}\) is generated from state \(s_{j}\) and accounted for by the \(m^{th}\) component of the Gaussian mixture density of state \(s_{j}\). |
\(\mu _{jm}\) | The mean of the \(m^{th}\) mixture in state \(s_{j}\). |
\(\Sigma _{jm}\) | The covariance matrix of the \(m^{th}\) mixture in state \(s_{j}\). |
\(c_{jm}\) | \(m^{th}\) Mixture weights in state \(s_{j}\), where \(\sum _{m=1}^{M} c_{jm}=1\) and \(c_{jm}\ge 0\). |
\(\delta _t(i)\) | The likelihood score of the optimal (most likely) sequence of hidden states of length t (ending in state \(s_{i}\)) that produce the first t observations for the given model. \(\delta _{t}(i)=\underset{q_{1},q_{2},...,q_{t-1}}{\max }P(q_{1},q_{2},...,q_{t-1},q_{t}=s_{i},o_{1},o_{2},...,o_{t-1}\mid \lambda )\). |
\(\psi _t(i)\) | The array of back pointers that stores the node of the incoming arc that led to this most probable path. |
-
Evaluation: Given a hidden Markov model \(\lambda \) and an observation sequence O, find \(P(\lambda | O)\), the probability that the sequence O was generated by the model \(\lambda \).
-
Decoding (searching for the most likely path): Given a hidden Markov model \(\lambda \) and an observation sequence O, find the state sequence Q that maximizes the probability of observing this sequence, \(P(\lambda |O)\).
-
Learning: Given a hidden Markov model \(\lambda \) with unspecified transition/emission probabilities and a set of observation sequences, find the parameters A and B of the hidden Markov model to maximize the probabilities of these sequences, \(P(\lambda |O)\).
Big data
Features of big data
Big data challenges
-
accelerating data processing and analysis as much as possible to reduce the computation time.
-
developing new methods that can scale to big data and deal with heterogeneous and loosely structured data and with high computational complexity.
-
performing efficient synchronization between different systems and overcoming bottlenecks at system binding points.
-
proposing effective security solutions to increase the level of security and protect the privacy and information of individuals and companies, as security vulnerabilities can multiply with data proliferation.
-
improving the efficiency of algorithms in managing the large amount of data storage required.
-
producing real-time results by enabling real-time analysis of data flows from different sources, which gives more value to knowledge.
-
improving the management and manipulation of multidimensional data to overcome the problem of incomplete or noisy information.
-
reinforcing big data analytics.
-
exploiting advances in computer hardware in the development of increasingly rich classes of models.
Metaheuristics and big data optimization
Optimization problem
Metaheuristics concepts
-
Representation/encoding: The encoding of possible solutions to an optimization problem is a very important concept when designing metaheuristics. For each problem, it is necessary to choose the suitable operators and optimization functions so that the encoding is efficient, and then to check if this encoding respects a set of properties to be achievable.
-
Constraint satisfaction: The solution to an optimization problem can be described by assigning values to the variables of this problem and defining a set of constraints to be respected. Therefore, a solution is achievable if it respects all these constraints, which are generally difficult to formulate according to the studied problem and the chosen optimization criterion.
-
Optimization criterion/objective function: To formulate a data mining task as an optimization problem, it is necessary to identify the optimization criterion and to carefully define the objective function. Correctly choosing these two basic concepts ensures the development of an efficient optimization method and guarantees a very good quality of the solutions.
-
Performance analysis: Another fundamental concept that requires careful study is the performance analysis of metaheuristics. It is necessary to set the objectives of the experiments, then to choose the appropriate performance measures and finally, depending on the purpose of the experiments, to identify and calculate the indicators in order to evaluate the quality of the solutions.
Constraint satisfaction problem
Research methodology
Link between the HMM and CSP
-
Variables (states or nodes): \(\{s_{t_{1}}, s_{t_{2}}, ..., s_{t_{p}}\}\), where \(t_{1}, t_{2}, ..., t_{p}\) are fixed sequences of time. \(t_{1}\) is the current time, and \(t_{p}\) is the time for which we want to predict the state of the system.
-
Domains (state values): \(\{D_{1}, D_{2}, ..., D_{p}\}\), where \(D_{k}=\{s_{1}, s_{2}, ..., s_{N}\}\) for \(k=1, 2, ..., p\).
-
Constraints (state transitions, arcs or edges):\(\sum _{j=1}^N Pr\{s_{t}=s_{i} \mid s_{t+1}=s_{j}\}=1\), \(Pr\{s_{t}=s_{i} \mid s_{t+1}=s_{j}\}\ge 0\) for \(i,j=1, 2, ..., N\).
Resolution algorithm
Data extraction
Feature selection
Problem modeling
HMM state space optimization using the CSP approach
Learning step using the Baum-Welch Algorithm
Prediction step using the Viterbi Algorithm
Model evaluation
Experiments and results
Case study
-
States: very small rise (0\(\le \)Variation<0.1%), small rise (0.1\(\%\le \)Variation<1%), large rise (1\(\%\le \)Variation<2%), very large rise (Variation\(\ge \)2%), very small drop (– 0.1\(\%<\)Variation\(\le \)0), small drop (– 1\(\%<\)Variation\(\le \)– 0.1%), large drop (– 2\(\%<\)Variation\(\le \)– 1%), and very large drop (Variation\(\le \)-2%), where Variation = Close price(t+1)-Close price(t).
-
Observations: djia-close-price, djia-high, djia-low, djia-variation, djia-volume, nasdaq-high, nasdaq-low, nasdaq-variation, nasdaq-volume, s&p500-high, s&p500-low, and s&p500-variation, s&p500-volume.
Test data
Date | Open | High | Low | Close | Adjusted | Volume |
---|---|---|---|---|---|---|
Jan 20, 2017 | 19795.0605 | 19843.9394 | 19759.1406 | 19827.2500 | 19827.2500 | 435260000 |
Jan 23, 2017 | 19794.7890 | 19833.9804 | 19732.3593 | 19799.8496 | 19799.8496 | 326690000 |
Jan 16, 2020 | 29131.9492 | 29300.3203 | 29131.9492 | 29297.6406 | 29297.6406 | 252110000 |
Jan 17, 2020 | 29313.3105 | 29373.6191 | 29289.9101 | 29348.0996 | 29348.0996 | 321820000 |
Experimental setup
Computational complexity
Algorithms | Computational Complexity | |
---|---|---|
Original algorithm | CSP-Optimized algorithm | |
Baum-Welch | \(O (N^2(T-1))\) | \(O ({N^\prime }^2(T-1))\) |
Viterbi | \(O (N^2(T-1))\) | \(O ({N^\prime }^2(T-1))\) |
Performance evaluation
Running time
Algorithm | Running time (s) | |
---|---|---|
Original algorithm | CSP-Optimized algorithm | |
Viterbi | 0,1 | 0,08 |
Iterations number | Running time (s) | |
---|---|---|
Original Baum-Welch | CSP-Optimized Baum-Welch | |
100000 | 2580 | 2520 |
10000 | 2054 | 1987 |
1000 | 1976 | 1933 |
600 | 1820 | 1768 |
Quality of prediction
Iterations number | Accuracy of algorithms (%) | |
---|---|---|
Standard HMM | CSP-Optimized HMM | |
100000 | 96.15 | 96.86 |
10000 | 94.36 | 94.36 |
1000 | 91.05 | 89.45 |
600 | 87.08 | 85.12 |
Iterations number | Recall Value of algorithms (%) | |
---|---|---|
Standard HMM | CSP-Optimized HMM | |
100000 | 91.85 | 90.93 |
10000 | 87.30 | 86.60 |
1000 | 87.40 | 87.95 |
600 | 84.50 | 86.20 |
Iterations number | Precision Value of algorithms (%) | |
---|---|---|
Standard HMM | CSP-Optimized HMM | |
100000 | 87.40 | 88.30 |
10000 | 85.90 | 86.07 |
1000 | 86.40 | 86.16 |
600 | 78.60 | 76.83 |
Iterations number | F-measure Value of algorithms (%) | |
---|---|---|
Standard HMM | CSP-Optimized HMM | |
100000 | 89.56 | 89.59 |
10000 | 86.59 | 86.33 |
1000 | 86.89 | 87.04 |
600 | 81.44 | 81.24 |
Iterations number | MAPE of prediction accuracy of algorithms (%) | |
---|---|---|
Standard HMM | CSP-Optimized HMM | |
100000 | 03.85 | 03.14 |
Result validation
CSP-Optimezed HMM Correct Predictions | CSP-Optimized HMM Incorrect Predictions | |
---|---|---|
Standard HMM Correct Predictions | a=714 (number of (Yes/Yes)) | b=12 (number of (Yes/No)) |
Standard HMM Incorrect Predictions | c=25 (number of (No/Yes)) | d=03 (number of (No/No)) |