Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

Concentration Free Outlier Detection

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present a novel notion of outlier, called Concentration Free Outlier Factor (CFOF), having the peculiarity to resist concentration phenomena that affect other scores when the dimensionality of the feature space increases. Indeed we formally prove that \(\hbox {CFOF}\) does not concentrate in intrinsically high-dimensional spaces. Moreover, \(\hbox {CFOF}\) is adaptive to different local density levels and it does not require the computation of exact neighbors in order to be reliably computed. We present a very efficient technique, named \({\textit{fast-}\hbox {CFOF}}\), for detecting outliers in very large high-dimensional datasets. The technique is efficiently parallelizable, and we provide a MIMD-SIMD implementation. Experimental results witness for scalability and effectiveness of the technique and highlight that \(\hbox {CFOF}\) exhibits state of the art detection performances.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Notice that k (or k / n), representing a neighborhood width, denotes the output of \(\hbox {CFOF}\), while the other outlier definitions employ k as an input parameter. We warn the reader that, in order to make more intelligible the comparison of \(\hbox {CFOF}\) with other outlier techniques, sometimes we will refer to k as an input parameter (the use will be clear from the context). Moreover, in order to avoid confusion and to maintain analogy with the input parameter \(\varrho \), we also refer to \(\varrho \) as \(k_\varrho \).
 
2
It is generally recognized that this cost can be reduced to \(O(d n \log n)\) in low dimensional spaces.
 
3
Alternatively, by exploiting the Normal approximation of the Binomial distribution, a suitable value for \(k_{up}(x,y)\) is given by \(k_{up}(x,y) = n \widehat{p}(x,y) + c \sqrt{n \widehat{p}(x,y) (1-\widehat{p}(x,y))}\) with \(c\in [0,3]\).
 
Literatur
1.
Zurück zum Zitat Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2001) Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2001)
3.
Zurück zum Zitat Angiulli, F.: On the behavior of intrinsically high-dimensional spaces: distance distributions, direct and reverse nearest neighbors, and hubness. Manuscript submitted for publication to an international journal (2017). Available at the author’s site Angiulli, F.: On the behavior of intrinsically high-dimensional spaces: distance distributions, direct and reverse nearest neighbors, and hubness. Manuscript submitted for publication to an international journal (2017). Available at the author’s site
4.
Zurück zum Zitat Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data, 3(1), Article 4 (2009) Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data, 3(1), Article 4 (2009)
5.
Zurück zum Zitat Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2(17), 203–215 (2005)CrossRefMATH Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2(17), 203–215 (2005)CrossRefMATH
7.
Zurück zum Zitat Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2000) Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the International Conference on Managment of Data (SIGMOD) (2000)
8.
Zurück zum Zitat Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)CrossRef Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)CrossRef
9.
Zurück zum Zitat Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. 24(5), 823–839 (2012)CrossRef Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection for discrete sequences: a survey. IEEE Trans. Knowl. Data Eng. 24(5), 823–839 (2012)CrossRef
10.
Zurück zum Zitat Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)CrossRef Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)CrossRef
11.
Zurück zum Zitat Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001) Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)
12.
Zurück zum Zitat Hautamaki, V., Karkkainen, I., Franti, P.: Outlier detection using k-nearest neighbour graph. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 430–433 (2004) Hautamaki, V., Karkkainen, I., Franti, P.: Outlier detection using k-nearest neighbour graph. In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 430–433 (2004)
13.
Zurück zum Zitat Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATH Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)CrossRefMATH
14.
15.
Zurück zum Zitat Jin, W., Tung, A.K.H., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD) (2001) Jin, W., Tung, A.K.H., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD) (2001)
16.
Zurück zum Zitat Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 392–403 (1998) Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the International Conference on Very Large Databases (VLDB), pp. 392–403 (1998)
17.
Zurück zum Zitat Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011) Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A.: Interpreting and unifying outlier scores. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 13–24 (2011)
18.
Zurück zum Zitat Kriegel, H.-P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the International Confernce on Knowledge Discovery and Data Mining (KDD), pp. 444–452 (2008) Kriegel, H.-P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the International Confernce on Knowledge Discovery and Data Mining (KDD), pp. 444–452 (2008)
19.
Zurück zum Zitat Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. TKDD 6(1), 3:1–3:39 (2012) Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. TKDD 6(1), 3:1–3:39 (2012)
20.
Zurück zum Zitat Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the International Conference on Data Enginnering (ICDE), pp. 315–326 (2003) Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the International Conference on Data Enginnering (ICDE), pp. 315–326 (2003)
21.
Zurück zum Zitat Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)MathSciNetMATH Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Hubs in space: popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 11, 2487–2531 (2010)MathSciNetMATH
22.
Zurück zum Zitat Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5), 1369–1382 (2015)CrossRef Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5), 1369–1382 (2015)CrossRef
23.
Zurück zum Zitat Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 427–438 (2000) Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 427–438 (2000)
24.
25.
Zurück zum Zitat Zimek, A., Schubert, E., Kriegel, H.-P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRef Zimek, A., Schubert, E., Kriegel, H.-P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRef
Metadaten
Titel
Concentration Free Outlier Detection
verfasst von
Fabrizio Angiulli
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-71249-9_1