Top

Published in:

14-01-2023 | Regular Paper

A meta-level analysis of online anomaly detectors

Authors: Antonios Ntroumpogiannis, Michail Giannoulis, Nikolaos Myrtakis, Vassilis Christophides, Eric Simon, Ioannis Tsamardinos

Published in: The VLDB Journal | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., distance, density, tree or projection based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the trade-offs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.

previous article Time-topology analysis on temporal graphs

next article SQUID: subtrajectory query in trillion-scale GPS database

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

In this paper, we use the terms outlier, novelty and anomaly detection interchangeably

We call “sample” an element (i.e., observation, measurement) of a data stream.

An hyper-parameter cannot be estimated from the data.

The recently proposed distance-based detector NETS [73] consumes less resources than MCOD but is not reporting any improvement in terms of effectiveness.

nearest neighbors are distinguished according to maximum and average distance.

\(2^{h+1} - 1\) is the number of nodes in a perfect binary tree.

https://aws.amazon.com/kinesis/

\(2 n - 1\) is the number of nodes of a full binary tree with n leaves.

We consider the coordinates of each point as (F2, F1).

in our implementation we consider the simple forgetting mechanism rather than the time-decaying mechanism.

Authors report that the value \(\gamma =0.01\) is optimal for the most datasets.

https://github.com/Waikato/moa

https://infolab.usc.edu/Luan/Outlier/CountBasedWindow/DODDS/src/outlierdetection/

https://github.com/kaist-dmlab/STARE

https://github.com/tranvanluan2/cpod

https://github.com/cmuxstream/cmuxstream-core

Online version: http://agents.fel.cvut.cz/stegodata/tools/ Offline Version: https://github.com/yzhao062/pyod

https://github.com/bedanta01/Subspace-Outlier-Detection

https://scikit-learn.org/

https://github.com/ngoix/OCRF

After removing null values and categorical features not treated by our anomaly detectors.

https://www.ipd.kit.edu/~muellere/HiCS/

There are also some cases of unknown anomalies they may appear in any of those categories.

k value is automatically computed during training.

When p is stored in a leaf of size larger than one, the value of !t(p) is adjusted to c(size).

In most experimental studies [15, 24, 25], detectors were executed using the default values of their hyper-parameters as “recommended by their authors.”

RS does not require to compute the gradient of the problem to be optimized and hence be used on functions that are not continuous or differentiable [31].

https://en.wikipedia.org/wiki/Coefficient_of_variation

Aggarwal, C.: An Introduction to Outlier Analysis, pp. 1–40 (2013)

Aggarwal, C., Hinneburg, A., Keim, A.: On the surprising behavior of distance metrics in high dimensional spaces ICDT (2001)

Aggarwal, C., Sathe, S.: Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explor. 17(1), 74 (2015)CrossRef

Aggarwal, C., Sathe, S.: Outlier Ensembles-An Introduction. Springer, Berlin (2017)CrossRef

Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., Whittle, S.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB 8(12), 68 (2015)

Alcobaça, E., Siqueira, F., Rivolli, A., Garcia, L., Oliva, J., de Carvalho, A.: Mfe: towards reproducible meta-feature extraction. JMLR 21(111), 1–5 (2020)

Bailis, P., Gan, E., Madden, S., Narayanan, D., Rong, K., Suri, S.: Macrobase: prioritizing attention in fast data. In: SIGMOD (2017)

Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. JMLR 11, 994 (2010)MathSciNetMATH

Bergmeir, C., Benítez, M.: On the use of cross-validation for time series predictor evaluation. Inf. Sci. 7, 191 (2012)

10.

Birge, L., Rozenholc, Y.: How many bins should be put in a regular histogram. In: ESAIM: Probability and Statistics, pp. 24–45 (2006)

11.

Blázquez-García, A., Conde, A., Mori, U., Lozano, J.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54(3), 98 (2021)

12.

Braei, M., Wagner, S.: Anomaly detection in univariate time-series: a survey on the state-of-the-art. CoRR 00433, 2020 (2004)

13.

Branco, P., Torgo, L., Ribeiro, R.: A survey of predictive modelling under imbalanced distributions. CoRR, 1505.01658 (2015)

14.

Breunig, M., Kriegel, H., Ng, R., Sander, J.: Lof: identifying density-based local outliers. SIGMOD Rec. 29(2), 799 (2000)CrossRef

15.

Campos, G., Zimek, A., Sander, J., Campello, R., Micenková, B., Schubert, E., Assent, I., Houle, M.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30(4), 891–927 (2016)MathSciNetCrossRef

16.

Cao, L., Yang, D., Wang, Q., Yu, Y., Wang, J., Rundensteiner, E.: Scalable distance-based outlier detection over high-volume data streams (2014)

17.

Carbone, P., Fragkoulis, M., Kalavri, V., Katsifodimos, A.: Beyond analytics: the evolution of stream processing systems. In: SIGMOD (2020)

18.

Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 96 (2009)CrossRef

19.

Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE TKDE 24(5), 119 (2012)

20.

Choudhary, D., Arun Kejariwal, A., Orsini, F.: On the runtime-efficacy trade-off of anomaly detection techniques for real-time streaming data. CoRR, arxiv:1710.04735 (2017)

21.

Cook, A., Mısırlı, G., Fan, Z.: Anomaly detection for IoT time-series data: a survey. IEEE IoT J. 7(7), 88 (2020)

22.

Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: (ICML’06), pp. 233–240 (2006)

23.

Demšar, J.: Statistical comparisons of classifiers over multiple data sets. In: JMLR, 7, December (2006)

24.

Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms. Pattern Recogn. 74(C), 406–421 (2018)MATHCrossRef

25.

Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 478 (2018)MATHCrossRef

26.

Dua, D., Graff, C.: Uci Machine Learning Repository. University of California, School of Information and Computer Sciences, Irvine (2017)

27.

Dudani, S.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. SMC 6(4), 325–327 (1976)

28.

Emmott, A., Das, S., Dietterich, T., Fern, A., Wong, W.: A meta-analysis of the anomaly detection problem. CoRR, arxiv:1503.01158 (2015)

29.

Goix, N., Drougard, N., Brault, R., Chiapino, M.: One class splitting criteria for random forests. In: ACML (2017)

30.

Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS One 11(4), 887 (2016)CrossRef

31.

Granichin, O.N., Volkovich, Z., Toledano-Kitai, D.: Randomized Algorithms in Automatic Control and Data Mining, vol. 67. Springer, Berlin (2015)

32.

Guha, S., Mishra, N., Roy, G., Schrijvers, O.: Robust random cut forest based anomaly detection on streams. In: ICML’16, pp. 2712–2721 (2016)

33.

Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data: a survey. IEEE TKDE 26(9), 83 (2014)MATH

34.

Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, Berlin (2009)MATHCrossRef

35.

Herbold, S.: Autorank: a python package for automated ranking of classifiers. J. Open Source Softw. 3, 2173 (2020)CrossRef

36.

Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 69 (2004)MATHCrossRef

37.

Jacob, V., Song, F., Stiegler, A., Rad, B., Diao, Y., Tatbul, N.: Exathlon: a benchmark for explainable anomaly detection over time series. PVLDB 14(11), 58 (2021)

38.

Keller, F., Müller, E., Böhm, K.: (2012) Hics: high contrast subspaces for density-based outlier ranking. In: ICDE, pp. 1037–1048

39.

Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI (1995)

40.

Kontaki, M., Gounaris, A., Papadopoulos, A., Tsichlas, K., Manolopoulos, Y.: Continuous monitoring of distance-based outliers over data streams. In: ICDE (2011)

41.

Li, J., Maier, D., Tufte, K., Papadimos, V., Tucker, P.: Semantics and evaluation techniques for window aggregates in data streams. In: SIGMOD (2005)

42.

Lindner, G., Studer, R.: Ast: Support for algorithm selection with a CBR approach. In: Principles of Data Mining and Knowledge Discovery, pp. 418–423 (1999)

43.

Liu, T., Ting, K. Ming, Zhou, Z.: Isolation forest. In: ICDM, pp. 413–422 (2008)

44.

Lobo, J., Jiménez-Valverde, A., Real, R.: AUC: a misleading measure of the performance of predictive distribution models. Global Ecol. Biogeogr. 17(2), 9008 (2008)CrossRef

45.

Manzoor, E., Lamba, H., Akoglu, L.: Xstream: Outlier detection in feature-evolving data streams KDD. (2018)

46.

Na, Gyoung S., Kim, Donghyun, Yu., Hwanjo: Dilof: Effective and memory efficient local outlier detection in data streams. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1993–2002 (2018)

47.

Orair, G., Teixeira, C., Meira, W., Wang, Y., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 788 (2010)

48.

Pang, G., Shen, C., Cao, L., Hengel, A.: Deep learning for anomaly detection: a review. ACM Comput. Surv. 54(2), 89 (2021)

49.

Pevný, T.: Loda: lightweight on-line detector of anomalies. Mach. Learn. 102(2), 116 (2016)MathSciNetMATHCrossRef

50.

Qin, X., Cao, L., Rundensteiner, E.A., Madden, S.: Scalable kernel density estimation-based local outlier detection over large data streams. In: EDBT (2019)

51.

Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, pp. 427–438 (2000)

52.

Rastrigin, L.A.: The convergence of the random search method in the extremal control of a many parameter system. Autom. Remote Control 4, 1337–1342 (1963)

53.

Rogers, J., Gunn, S.: Identifying feature relevance using a random forest. In: SLSFS, pp. 173–184, Bohinj, Slovenia (2005)

54.

Roy, S.N.: On a Heuristic method of test construction and its use in multivariate analysis. Ann. Math. Stat. 6, 220–238 (1953)MathSciNetMATHCrossRef

55.

Sadik, S., Gruenwald, L.: Research issues in outlier detection for data streams. SIGKDD Explor. Newsl. 15(1), 78 (2014)CrossRef

56.

Saito, T., Rehmsmeier, M.: The precision-recall plot is more informative than ROC when evaluating binary classifiers on imbalanced datasets. PLoS One 10(3), 708 (2015)CrossRef

57.

Sathe, S., Aggarwal, C.: Subspace histograms for outlier detection in linear time. KAIS 56(3), 68 (2018)

58.

Silva, J., Faria, E., Barros, R., Hruschka, E., de Carvalho, A., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 9114 (2013)MATHCrossRef

59.

Somol, P., Grim, J., Filip, J., Pudil, P.: On stopping rules in dependency-aware feature ranking. In: CIARP (2013)

60.

Tan, C., Ting, M., Liu, T.: Fast anomaly detection for streaming data. In: IJCAI (2011)

61.

Tatbul, N., Lee, T.J., Zdonik, S., Alam, M., Gottschlich, J.: Precision and recall for time series. In: NIPS (2018)

62.

Ting, K.M., Washio, T., Wells, J.R., Aryal, S.: Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach. Learn. 5, 55–91 (2017)MathSciNetMATHCrossRef

63.

Tran, L., Fan, L., Shahabi, C.: Distance-based outlier detection in data streams. PVLDB 9(12), 96 (2016)

64.

Tran, L., Mun, M., Shahabi, C.: Real-time distance-based outlier detection in data streams. PVLDB 14(2), 7006 (2020)

65.

van Stein, B., van Leeuwen, M., Bäck, T.: Local subspace-based outlier detection using global neighbourhoods. CoRR, arxiv:1611.00183 (2016)

66.

Vanschoren, J.: Meta-Learning, pp. 35–61 (2019)

67.

Vanschoren, J., van Rijn, J., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. SIGKDD Explor. 15(2), 96 (2013)

68.

Wang, H., Bah, J., Hammad, M.: Progress in outlier detection techniques: a survey. IEEE Access 7, 998 (2019)

69.

Wu, R., Keogh, E.: Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. In: IEEE TKDE (2021)

70.

Xia, S., Xiong, Z., Luo, Y., WeiXu, Z.G.: Effectiveness of the Euclidean distance in high dimensional spaces. Optik 4, 5614–5619 (2015)

71.

Yang, J., Rahardja, S., Fränti, P.: Outlier detection: how to threshold outlier scores? In: AIIPCC (2019)

72.

Yoon, S., Lee, J., Lee, B.: Ultrafast local outlier detection from a data stream with stationary region skipping. In: KDD (2020)

73.

Yoon, S., Lee, J., Lee, B.S.: Nets: extremely fast outlier detection from a data stream via set-based processing. PVLDB 12(11), 998 (2019)

74.

Zhang, E., Zhang, Y.I.: Average precision. In: Encyclopedia of Database Systems (2009)

75.

Zhao, Y., Rossi, A., Akoglu, L.: Automating outlier detection via meta-learning. CoRR 2009, 10606 (2020)

76.

Zimek, A., Filzmoser, P.: There and back again: outlier detection between statistical reasoning and data mining algorithms. Int. Rev. Data Min. Knowl. Discov. 8(6), 66 (2018)

77.

Zimek, A., Gaudet, M., Campello, R., Sander, J.: Subsampling for efficient and effective unsupervised outlier detection ensembles. In: KDD (2013)

78.

Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Mini. 5(5), 997 (2012)

Title: A meta-level analysis of online anomaly detectors
Authors: Antonios Ntroumpogiannis
Michail Giannoulis
Nikolaos Myrtakis
Vassilis Christophides
Eric Simon
Ioannis Tsamardinos
Publication date: 14-01-2023
Publisher: Springer Berlin Heidelberg
Published in: The VLDB Journal / Issue 4/2023
Print ISSN: 1066-8888
Electronic ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-022-00773-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Robust and scalable content-and-structure indexing

PCG: a privacy preserving collaborative graph neural network training framework

A survey on deep learning approaches for text-to-SQL

ProS: data series progressive k-NN similarity search and classification with probabilistic quality guarantees

Time-topology analysis on temporal graphs

Data collection and quality challenges in deep learning: a data-centric AI perspective

Premium Partner