1 Introduction
BuildHON
for extracting higher-order dependencies from sequential data to build the Higher-Order Network (HON) representation. BuildHON
, although accurate, faced the challenge of computational complexity as well as parameter dependency. In this work, we address these limitations by proposing a scalable and parameter-free algorithm, BuildHON+
, for accurate extraction of higher-order dependencies from sequential data. Given BuildHON+
, we are also interested in downstream network analysis tasks, adn we focus on the following question in this paper that has not been addressed in prior HON work: Does incorporating higher-order dependencies improve the performance of existing network-based methods for detecting anomalous signals in the sequential data?- We develop a scalable and parameter-free algorithm for higher-order network representation,
BuildHON+
, building on our prior work [2]. We demonstrate the efficiency ofBuildHON+
through comprehensive complexity and performance analysis on the global ship movement data, which is known to exhibit dependencies beyond the fifth order. - We showcase the performance of
BuildHON+
in the task of network-based anomaly detection on a real-world taxi trajectory data. We explain why the parameter dependency in our prior work can be limiting for efficient network construction and as a result, anomaly detection. - Using a large-scale synthetic taxi movement data with 11 billion taxi movements, we show how multiple existing anomaly detection methods that depend on FON collectively fail to capture anomalous navigation behaviors beyond first-order, and how
BuildHON+
can solve the problem.
2 Related work
3 Methods
BuildHON+
. We then show how this new approach enables more accurate anomaly detection (compared to using FON) by incorporating several different network distance measures. Our previous algorithm, BuildHON
required two parameters that had to be specified experimentally, depending on the data set. Furthermore, it uses an exhaustive search for extracting the dependency rules and constructing the network, which becomes impractical for various network analysis tasks, including anomaly detection. It needs two parameters in addition to the detection threshold: a MaxOrder parameter which governs how many orders of dependencies the algorithm will consider in HON, and a MinSupport parameter that discards infrequent observations. These limitations mitigate its applicability to Big Data.3.1 BuildHON+
: building HON from big data
BuildHON+
, a parameter-free algorithm that constructs HON from big data sets. BuildHON+
is a practical approach that preserves higher-order signals in the network representation step (\(S_{i} \rightarrow G_{i}\)) which is essential for anomaly detection. The difference between BuildHON
and BuildHON+
is similar to the difference between pruning and early stopping in decision trees. BuildHON
first builds a HON of all orders from first-order to MaxOrder and then selects branches showing significant higher-order dependencies. BuildHON+
reduces the search space beforehand by checking in each step if increasing the order may produce significant dependencies. Furthermore, BuildHON
can only discover dependencies up to MaxOrder. BuildHON+
however, finds the appropriate dependency order hidden in the raw data and is not limited by MaxOrder. Therefore, the output network resulting from BuildHON+
is a more reliable and accurate representation of the raw data, which is essential for the task of anomaly detection.BuildHON
is the dependency rule extraction step, which answers whether higher-order dependencies exist in the raw sequential data, and how high the orders are. The dependency rules extracted are then converted to higher-order nodes and edges as the building blocks of HON. Rather than deriving a fixed order of dependency for the whole network, the method allows for variable orders of dependencies for more compact representation. Figure 2 illustrates the dependency rule extraction step. BuildHON
first counts the observed n-grams in the raw data (step ), then compute probability distributions for the next steps given the current and previous steps (step ). Finally test if knowing one more previous step significantly changes the distribution for the next step—if so, higher-order dependency exists for the path (step ); this procedure (“rule growing”) is iterated recursively until a pre-defined MaxOrder (shown here \(\mathit{MaxOrder}=3\)). In this example, the probability distribution of the next steps from C changes significantly if the previous step (coming to C from A or B) is known (step ), but knowing more previous steps (coming to C from \(E \rightarrow A\) or \(D\rightarrow B\)) does not make a difference (step ); therefore, paths \(C|A \rightarrow D\) and \(C|A \rightarrow E\) demonstrate second-order dependencies.
3.1.1 Eliminating all parameters
BuildHON
is to set a hard stop for the rule growing process, otherwise, it will iterate indefinitely and keep extending \(\mathcal{S}\). However, we show that we can pre-determine if extending \(\mathcal{S}\) will not produce significantly different distributions, which forms an important basis for BuildHON+
.BuildHON+
no longer requires a MinSupport parameter. Recall that using \(\mathit{MinSupport} >1\) in BuildHON
helps reduce the search space as a crude form of early stopping, with the risk of losing valid higher-order patterns. In BuildHON+
, the dynamic threshold takes care of early stopping without requiring any extra parameter (MinSupport) to limit the search space. This parameter is left in the algorithm only for backward compatibility and is set to 1 by default, but does not serve any initial seeding purpose. In other words, MinSupport is not used in BuildHON+
.3.1.2 Scalability for higher-orders
BuildHON
builds all observations and distributions up to MaxOrder ahead of the rule growing process (Fig. 2 left). This procedure becomes prohibitively expensive for big data with high orders of dependencies: to extract sparse tenth order dependencies, BuildHON
needs to enumerate n-grams from first-order to tenth order and compare probability distributions, which already exceeds a personal computer’s capacity using a typical real-world data set (see Sect. 4).BuildHON+
, on the other hand, uses a lazy construction of observations and distributions that has a much smaller search space, and can easily scale to arbitrarily high order of dependency. Specifically, BuildHON+
does not require the counting of the occurrences of n-grams or calculating the distribution of the next steps, until the rule growing step explicitly asks for such information.BuildHON+
first builds all first-order observations and distributions (Fig. 2 right step –). Given that \(A\rightarrow C\), \(B\rightarrow C\), \(D\rightarrow B\), \(E\rightarrow A\) all have single deterministic options for the next step with \(P=1\), according to \(-\log _{2}(\min (P_{\mathrm{Distr}}(i))) = 0 < \delta \), BuildHON+
knows no higher-order dependencies can possibly exist by extending these bigrams (step ). Only the two paths \(C\rightarrow D\) and \(C\rightarrow E\) will be extended; since the corresponding second-order observations and distributions are not known yet, BuildHON+
selectively derives the necessary information from the raw data (Fig. 2 right step –), and finds that the second-order distributions show significant changes. At this point, both \(C|A\rightarrow D\) and \(C|B\rightarrow E\) have single deterministic options for the next step, so again, BuildHON+
determines no dependencies beyond second-order can exist (step ), so the rule growing procedure stops, without the need for further generation and comparison of distributions.BuildHON
and BuildHON+
.BuildHON
. Suppose the size of raw sequential data is L, and there are \(D_{i}\) distinct n-grams of order of i. All first-order observations (bigrams) take \(\varTheta (2D_{2})\) space, second order observations (trigrams) take \(\varTheta (3D_{3})\) space, and so on; building observations and distributions up to \(k^{\mathrm{th}}\) order takes \(\varTheta (2D_{2}+3D_{3}+\cdots +kD_{k})\) storage, with k being the maximum order allowed, because BuildHON
always keeps raising order until k is reached, while keeping all the breadth-first search results for lower orders. with \(D_{3}\geq D_{2}\), \(D_{4} \geq D_{3}\), resulting in a complexity of \(\varTheta (k^{2}D_{2})\).BuildHON+
. Assume there are \(R_{i}\) distinct n-grams that are exactly of order i. By definition, we have \(R_{i} \leq D_{i}\). Therefore, BuildHON+
’s space complexity is \(\varTheta (2R_{2}+3R_{3}+\cdots +tR_{t})\) (including observations, distributions, and the indexing cache) where \(R_{k}\) is the exact number of higher-order dependency rules for order k. Note that, \(R_{k}\leq L\), but it is not necessarily \(R_{(i+1)}\geq R_{(i)}\). Also \(t\leq k\).BuildHON+
different from BuildHON
is its sensitivity to the underlying data. If the dataset contains very few non-significant n-grams up to maximum specified order, the space complexity of BuildHON+
would not be very different from BuildHON
. However, for very noisy data \((D_{i} \gg R_{i})\) or data with an actual order much smaller than the specified maximum order \((t \ll k)\), the space complexity of BuildHON+
would be significantly smaller than BuildHON
. The same applies to time complexity: while BuildHON
has \(\varTheta (Nk^{2}D_{2})\), BuildHON+
has \(\varTheta (N(2R_{1}+3R_{2}+\cdots ))\). A side-by-side comparison between BuildHON
and BuildHON+
in running time and memory consumption on a real-world data set is provided in Sect. 4.3.2 Higher-order anomaly detection
BuildHON+
, keeps all structures of FON, and when higher-order dependencies exist in the raw sequential data, it splits a node into multiple nodes representing previous steps. We show that certain types of anomalies will remain undetected for all existing network-based anomaly detection methods using FON, but can be revealed by using HON.3.2.1 Distance metrics
BuildHON+
) we apply five network distance measures to detect anomalies. 4 Results
BuildHON+
with BuildHON
in terms of running time and memory consumption on real-world data of various sizes and multiple orders of dependency. Next, we present the anomaly detection results.4.1 Scalability analysis: performance improvement of BuildHON+
over BuildHON
BuildHON+
, instead of the taxi data or the synthetic data (which demonstrates up to third order of dependency), we use the same shipping trajectories data as used in the HON paper [2]. This data was shown to demonstrate dependencies of more than the fifth-order due to ships’ cyclic movement patterns. It consists of up to three years of shipping data (between May \({1^{\mathrm{st}}}\), 1997 and April \({30^{\mathrm{th}}}\), 2003), aggregated over 3-months intervals. The smallest and largest data contains 372,500 and 4,721,936 voyages, respectively.BuildHON+
and BuildHON
. Both implementations run single-threaded on the same Linux machine (Intel Quad 16-core @ 2.10 GHz, 128 GB RAM). BuildHON+
is parameter-free (no limit to the maximum order, optional \(\mathit{MinSupport} = 1\)) and does not require further configuration. We set \(\mathit{MinSupport} = 1\) and \(\mathit{MaxOrder}= 15\) for BuildHON
. We start with the first 3-months of the data and aggregate the trajectories over the next 6 months, 9 months, and so on. Figure 4 illustrate the time and memory usage of both algorithms as the size of the data increases. We observe that BuildHON
is highly sensitive to the size of the data. For the maximum data size, BuildHON
requires approximately 7.2 times more memory than BuildHON+
and takes 4.5 times longer to run.
BuildHON
, and gradually increase MaxOrder from the first-order. Same as above, BuildHON+
does not require further configuration. BuildHON+
was able to find up to \(11^{\mathrm{th}}\) order within 2 minutes, with a peak memory usage less than 5 GB, as the reference lines displayed in Fig. 5. In comparison, BuildHON
already exceeds the running time and memory consumption of BuildHON+
at \(6^{\mathrm{th}}\) order, reaches the physical memory limit at \(8^{\mathrm{th}}\) order, and would need about 22 GB memory and 6 minutes (3× time and 5× memory than BuildHON+
) to achieve the same results as BuildHON+
can. Both implementations run single-threaded on the same laptop (Intel i7-6600U @ 2.60 GHz, 16 GB RAM, SSD).
4.2 Anomaly detection: large-scale synthetic taxi movements
4.2.1 Results
BuildHON+
also creates additional higher-order nodes and edges for higher-order dependencies; (3) capture the six additional cases where higher-order movement patterns are changed but first-order traffic remains the same. Here the topological changes of HON are best reflected with weight distance and spectrum distance (detecting \(10/10\) anomalies); MCS weight method misses the addition of higher-order nodes and edges (\(t=400, 700, 900\)) because those topological changes are excluded from common subgraphs; entropy method misses changes in edge weights (\(t=200, 500, 800, 1000\)), also because by definition a swap in edge weights do not change a graph’s entropy. Nevertheless, all these distance metrics are able to identify more types of anomalous signals simply by using HON instead of FON, with no changes to these distance metrics. In other words, BuildHON+
can be plugged into existing network-based anomaly detection methods directly, and extend their ability in detecting higher-order anomalies.4.3 Anomaly detection: real world taxi data
BuildHON+
with a fixed maximum order of 2 and BuildHON+
with a variable higher-order (discovered to be 3 by the algorithm). We show that BuildHON+
when allowed to discover the maximum order, results in the highest indication of potential anomalies.
4.3.1 Graph distance analysis
BuildHON+
(indicated as HON-2), and no MaxOrder constraint on BuildHON+
(indicated as HON+) in Fig. 10. Our goal is to see the improvements afforded by allowing BuildHON+
to automatically discover the requisite higher-order for a given data, versus specifying the maximum order of 2 using BuildHON
and the FON representation. We compute the graph distances (using weight distance) for neighboring time windows. The histograms of graph distances for each network is shown in Fig. 10 (a), (b), and (c). We also compute the running average and standard deviation using the graph distances in the previous 10 weeks, with the null hypothesis as “the network is not significantly different if the graph distance does not deviate more than 2σ from the mean”. Variation of graph distances for FON, HON-2 and HON+ is shown in Fig. 10 (d), (e), and (f) respectively. While the trend of HON+ resembles that of HON-2 and FON, the graph distances in weeks 43 and 44 are particularly more significant in HON+ than HON-2 and FON (HON-2 offers more significance over FON as well). Such differences are also indicated in the histograms of graph distances in Fig. 10 (a), (b), and (c), where the red circles highlight the correct anomalous signals, which is observable in HON+, while it is not as significant in FON and even HON-2.
BuildHON+
in discovering the appropriate orders given the data. Depending on the data, the MaxOrder value required for accurate detection of anomalies can be different. BuildHON+
removes this dependency, ensuring accurate detection of changes in the network.4.4 Robustness to noise
FON before noise | FON after noise | HON+ before noise | HON+ after noise | |||||
---|---|---|---|---|---|---|---|---|
Week | Value | Week | Value | Week | Value | Week | Value | |
24 | 0.008 | 17 | 0.006 | 24 | 0.0001 | 41 | 0.003 | |
26 | 0.003 | 18 | 0.008 | 43 | 0.052 | 43 | 0.044 | |
41 | 0.009 | 26 | 0.002 | 44 | 0.026 | 44 | 0.025 | |
42 | 0.002 | 41 | 0.020 | – | – | – | – | |
43 | 0.010 | 42 | 0.004 | – | – | – | – | |
44 | 0.008 | 44 | 0.003 | – | – | – | – | |
TP;FP | 2;4 | 1;5 | 2;1 | 2;1 | ||||
Precision | 0.333 | 0.167 | 0.667 | 0.667 | ||||
Recall | 1 | 0.5 | 1 | 1 |
5 Conclusion
BuildHON+
is scalable and parameter-free and automates the process of discovering the appropriate variable and higher-order dependencies for each of the nodes in a network. The complexity analysis of BuildHON+
, as well as running time and memory consumption benchmarking results, demonstrates the scalability of BuildHON+
to large-scale networks.BuildHON+
can accurately capture such anomalies and also work seamlessly with existing anomaly detection methods to enable more accurate detection of anomalies in comparison to using FON.BuildHON+
, as the resulting HON representation does not impose a change in the network analysis method.