Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 4/2021

02-11-2020 | Original Article

usfAD: a robust anomaly detector based on unsupervised stochastic forest

Authors: Sunil Aryal, K.C. Santosh, Richard Dazeley

Published in: International Journal of Machine Learning and Cybernetics | Issue 4/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In real-world applications, data can be represented using different units/scales. For example, weight in kilograms or pounds and fuel-efficiency in km/l or l/100 km. One unit can be a linear or non-linear scaling of another. The variation in metrics due to the non-linear scaling makes Anomaly Detection (AD) challenging. Most existing AD algorithms rely on distance- or density-based functions, which makes them sensitive to how data is expressed. This means that they are representation dependent. To avoid such a problem, we introduce a new anomaly detection method, which we call ‘usfAD: Unsupervised Stochastic Forest-based Anomaly Detector’. Our empirical evaluation in synthetic and real-world cybersecurity (spam detection, malicious URL detection and intrusion detection) datasets shows that our approach is more robust to the variation in units/scales used to express data. It produces more consistent and better results than five state-of-the-art AD methods namely: local outlier factor; one-class support vector machine; isolation forest; nearest neighbor in a random subsample of data; and, simple histogram-based probabilistic method.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
2.
go back to reference Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 589–601 Aryal S (2018) Anomaly detection technique robust to units and scales of measurement. In: Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 589–601
4.
go back to reference Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp 73–86 Aryal S, Ting KM, Haffari G (2016) Revisiting attribute independence assumption in probabilistic unsupervised anomaly detection. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics, pp 73–86
5.
go back to reference Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506CrossRef Aryal S, Ting KM, Washio T, Haffari G (2017) Data-dependent dissimilarity measure: an effective alternative to geometric distance measures. Knowl Inf Syst 53(2):479–506CrossRef
7.
go back to reference Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iForest with Relative Mass. In: Proceedings of the 18th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 510–521 Aryal S, Ting KM, Wells JR, Washio T (2014) Improving iForest with Relative Mass. In: Proceedings of the 18th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 510–521
8.
go back to reference Bakshi BR (1999) Multiscale analysis and modelling using wavelets. J Chemom 13(1):415–434CrossRef Bakshi BR (1999) Multiscale analysis and modelling using wavelets. J Chemom 13(1):415–434CrossRef
9.
go back to reference Bandaragoda T, Ting KM, Albrecht D, Liu F, Wells J (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: Proceedings of the IEEE international conference on data mining workshops, pp 698–705 Bandaragoda T, Ting KM, Albrecht D, Liu F, Wells J (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: Proceedings of the IEEE international conference on data mining workshops, pp 698–705
10.
go back to reference Baniya AA, Aryal S, Santosh KC (2019) A novel data pre-processing technique: making data mining robust to different units and scales of measurement. In: Proceedings of the 26th international conference on neural information processing (ICONIP) of the Asia-Pacific Neural Network Society, (p. Accepted) Baniya AA, Aryal S, Santosh KC (2019) A novel data pre-processing technique: making data mining robust to different units and scales of measurement. In: Proceedings of the 26th international conference on neural information processing (ICONIP) of the Asia-Pacific Neural Network Society, (p. Accepted)
11.
go back to reference Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD conference on knowledge discovery and data mining, pp 29–38 Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ninth ACM SIGKDD conference on knowledge discovery and data mining, pp 29–38
12.
go back to reference Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRef Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRef
13.
go back to reference Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254 Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp 243–254
15.
go back to reference Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In; Proceedings of ACM SIGMOD conference on management of data, pp 93–104 Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In; Proceedings of ACM SIGMOD conference on management of data, pp 93–104
16.
go back to reference Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15-1-15–58CrossRef Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15-1-15–58CrossRef
17.
go back to reference Cheng T, Li Z (2006) A multiscale approach for spatio-temporal outlier detection. Trans GIS 10(2):253–263CrossRef Cheng T, Li Z (2006) A multiscale approach for spatio-temporal outlier detection. Trans GIS 10(2):253–263CrossRef
18.
go back to reference Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35(3):124–129MATH Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Statist 35(3):124–129MATH
19.
go back to reference Fernando TL, Webb GI (2017) SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286MathSciNetCrossRef Fernando TL, Webb GI (2017) SimUSF: An efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31(1):264–286MathSciNetCrossRef
20.
go back to reference Gao Z, Guo L, Ma C, Ma X, Sun K, Xiang H, Liu X et al (2019) AMAD: adversarial multiscale anomaly detection on high-dimensional and time-evolving categorical data. In: Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data (DLP-KDD ’19), pp 1–8 Gao Z, Guo L, Ma C, Ma X, Sun K, Xiang H, Liu X et al (2019) AMAD: adversarial multiscale anomaly detection on high-dimensional and time-evolving categorical data. In: Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data (DLP-KDD ’19), pp 1–8
21.
go back to reference Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence, pp 59–63 Goldstein M, Dengel A (2012) Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm. In: Proceedings of the 35th German Conference on Artificial Intelligence, pp 59–63
22.
go back to reference Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class. Mach Learn 45(2):171–186CrossRef Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple class. Mach Learn 45(2):171–186CrossRef
23.
24.
go back to reference Jiang H, Wang H, Hu W, Kakde D, Chaudhuri A (2017) Fast incremental SVDD learning algorithm with the Gaussian Kernel. In: Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI), pp 3991–3998 Jiang H, Wang H, Hu W, Kakde D, Chaudhuri A (2017) Fast incremental SVDD learning algorithm with the Gaussian Kernel. In: Proceedings of the Thirty-Third AAAI conference on artificial intelligence (AAAI), pp 3991–3998
25.
go back to reference Joiner BL (1981) Lurking variables: some examples. Am Statist 35(4):227–233 Joiner BL (1981) Lurking variables: some examples. Am Statist 35(4):227–233
26.
go back to reference Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the Eighth IEEE international conference on data mining, pp 413–422 Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the Eighth IEEE international conference on data mining, pp 413–422
27.
go back to reference Liu Q, Klucik R, Chen C, Grant G, Gallaher D, Lv Q, Shang L (2017) Unsupervised detection of contextual anomaly in remotely sensed data. Remote Sens Environ 202(1):75–87CrossRef Liu Q, Klucik R, Chen C, Grant G, Gallaher D, Lv Q, Shang L (2017) Unsupervised detection of contextual anomaly in remotely sensed data. Remote Sens Environ 202(1):75–87CrossRef
28.
go back to reference Lord FM (1953) On the statistical treatment of football numbers. Am Psychol 8(12):750–751CrossRef Lord FM (1953) On the statistical treatment of football numbers. Am Psychol 8(12):750–751CrossRef
29.
go back to reference Mamun MS, Rathore MA, Lashkari AH, Stakhanova N (2016) Detecting malicious URLs using lexical analysis. In: Proceedings of the international conference on network and system security (NSS 2016), pp 467–482 Mamun MS, Rathore MA, Lashkari AH, Stakhanova N (2016) Detecting malicious URLs using lexical analysis. In: Proceedings of the international conference on network and system security (NSS 2016), pp 467–482
30.
go back to reference Pang G, Cao L, Chen L, Liu H (2018) Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2041–2050 Pang G, Cao L, Chen L, Liu H (2018) Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2041–2050
31.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNetMATH
32.
go back to reference Rekha AG (2015) A fast support vector data description system for anomaly detection using big data. In: Proceedings of the 30th Annual ACM symposium on applied computing (SAC), pp 931–932 Rekha AG (2015) A fast support vector data description system for anomaly detection using big data. In: Proceedings of the 30th Annual ACM symposium on applied computing (SAC), pp 931–932
33.
go back to reference Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471CrossRef Scholkopf B, Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471CrossRef
34.
35.
go back to reference Siddiqui S, Khan MS, Ferens K (2017) Multiscale Hebbian neural network for cyber threat detection. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1427–1434 Siddiqui S, Khan MS, Ferens K (2017) Multiscale Hebbian neural network for cyber threat detection. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1427–1434
36.
go back to reference Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680CrossRef Stevens SS (1946) On the theory of scales of measurement. Science 103(2684):677–680CrossRef
37.
go back to reference Sugiyama M, Borgwardt KM (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th annual conference on neural information processing systems, pp 467–475 Sugiyama M, Borgwardt KM (2013) Rapid distance-based outlier detection via sampling. In: Proceedings of the 27th annual conference on neural information processing systems, pp 467–475
38.
go back to reference Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1):45–66CrossRef Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1):45–66CrossRef
39.
go back to reference Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91MathSciNetCrossRef Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91MathSciNetCrossRef
40.
go back to reference Townsend JT, Ashby FG (1984) Measurement scales and statistics: the misconception misconceived. Psychol Bull 96(2):394–401CrossRef Townsend JT, Ashby FG (1984) Measurement scales and statistics: the misconception misconceived. Psychol Bull 96(2):394–401CrossRef
41.
go back to reference Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72 Velleman PF, Wilkinson L (1993) Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat 47(1):65–72
42.
go back to reference Weinan E (2011) Principles of multiscale modeling (Vol 6). Cambridge University Press, CambridgeMATH Weinan E (2011) Principles of multiscale modeling (Vol 6). Cambridge University Press, CambridgeMATH
43.
go back to reference Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278CrossRef Zhong G, Wang L-N, Ling X, Dong J (2016) An overview on data representation learning: from traditional feature learning to recent deep learning. J Financ Data Sci 2(4):265–278CrossRef
Metadata
Title
usfAD: a robust anomaly detector based on unsupervised stochastic forest
Authors
Sunil Aryal
K.C. Santosh
Richard Dazeley
Publication date
02-11-2020
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 4/2021
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-020-01225-0

Other articles of this Issue 4/2021

International Journal of Machine Learning and Cybernetics 4/2021 Go to the issue