Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 1/2017

29.03.2016

Outlying property detection with numerical attributes

verfasst von: Fabrizio Angiulli, Fabio Fassetti, Giuseppe Manco, Luigi Palopoli

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The outlying property detection problem (OPDP) is the problem of discovering the properties distinguishing a given object, known in advance to be an outlier in a database, from the other database objects. This problem has been recently analyzed focusing on categorical attributes only. However, numerical attributes are very relevant and widely used in databases. Therefore, in this paper, we analyze the OPDP within a context where also numerical attributes are taken into account, which represents a relevant case left open in the literature. As major contributions, we present an efficient parameter-free algorithm to compute the measure of object exceptionality we introduce, and propose a unified framework for mining exceptional properties in the presence of both categorical and numerical attributes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For the sake of simplicity and without loss of generality, we are assuming that an arbitrary ordering of the attributes in A has been fixed.
 
2
We point out that \(k_\theta \) has a twofold function: it allows the analyst to control the complexity of the mined patterns and it speeds up the algorithm execution. However, by setting \(k_\theta \) to m the algorithm is able to detect explanations of any length, while pruning the search space and avoiding overfitting by means of the threshold support.
 
3
In practice, if a tuple is detected as an outlier in a given iteration, it gets a positive score. Scores are then summarized in the combine function, and tuples are sorted according to the scores.
 
4
Experiments were performed on an Intel Core i7 2.3 GHz based computer by using the Java programming language.
 
Literatur
Zurück zum Zitat Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceeding of the ACM SIGMOD conference on managment of data (SIGMOD’01), pp 37–46 Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceeding of the ACM SIGMOD conference on managment of data (SIGMOD’01), pp 37–46
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases (VLDB’94), pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases (VLDB’94), pp 487–499
Zurück zum Zitat Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):Article 4 Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):Article 4
Zurück zum Zitat Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):1–62CrossRef Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):1–62CrossRef
Zurück zum Zitat Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’96), pp 164–169 Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’96), pp 164–169
Zurück zum Zitat Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, ChichesterMATH Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, ChichesterMATH
Zurück zum Zitat Bay S, Pazzani M (1999) Detecting change in categorical data: mining constrast sets. In: Proceedings of the ACM conference on knowledge discovery in data (KDD’99), pp 302–306 Bay S, Pazzani M (1999) Detecting change in categorical data: mining constrast sets. In: Proceedings of the ACM conference on knowledge discovery in data (KDD’99), pp 302–306
Zurück zum Zitat Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD conference on managment of data (SIGMOD’00), pp 93–104 Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD conference on managment of data (SIGMOD’00), pp 93–104
Zurück zum Zitat Caroni C (2000) Outlier detection by robust principal component analysis. Commun Stat Simul Comput 29:129–151CrossRefMATH Caroni C (2000) Outlier detection by robust principal component analysis. Commun Stat Simul Comput 29:129–151CrossRefMATH
Zurück zum Zitat Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6CrossRef
Zurück zum Zitat Costa G, Fassetti F, Guarascio M, Manco G, Ortale R (2010) Mining models of exceptional objects through rule learning. In: Proceedings of the ACM symposium on applied computing (SAC’10), pp 1078–1082 Costa G, Fassetti F, Guarascio M, Manco G, Ortale R (2010) Mining models of exceptional objects through rule learning. In: Proceedings of the ACM symposium on applied computing (SAC’10), pp 1078–1082
Zurück zum Zitat Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In: Proceedings of the IEEE international conference on data engineering, (ICDE’14), pp 88–99 Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In: Proceedings of the IEEE international conference on data engineering, (ICDE’14), pp 88–99
Zurück zum Zitat Dang XH, Micenkov B, Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases (ECML-PKDD’13). Lecture Notes in Computer Science, vol 8190. pp 304–320 Dang XH, Micenkov B, Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases (ECML-PKDD’13). Lecture Notes in Computer Science, vol 8190. pp 304–320
Zurück zum Zitat De Vries, T, Chawla, S, Houle M (2010) Finding local anomalies in very high dimensional space. In: Proceedings of the IEEE international confence on data mining (ICDM’10), pp 128–137 De Vries, T, Chawla, S, Houle M (2010) Finding local anomalies in very high dimensional space. In: Proceedings of the IEEE international confence on data mining (ICDM’10), pp 128–137
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39
Zurück zum Zitat Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Dis 29(5):1116–1151MathSciNetCrossRef Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Dis 29(5):1116–1151MathSciNetCrossRef
Zurück zum Zitat Eskin E (2000) Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the international conference on machine learning (ICML’00), pp 255–262 Eskin E (2000) Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the international conference on machine learning (ICML’00), pp 255–262
Zurück zum Zitat Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24:381–396CrossRef Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24:381–396CrossRef
Zurück zum Zitat Fox J (1990) Describing univariate distributions. In: Fox J, Long JS (eds) Modern methods of data analysis. Sage Publications, Newbury Park, pp 58–125 Fox J (1990) Describing univariate distributions. In: Fox J, Long JS (eds) Modern methods of data analysis. Sage Publications, Newbury Park, pp 58–125
Zurück zum Zitat Ghoting A, Parthasarathy S, Otey ME (2015) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Dis 16:349–364MathSciNetCrossRef Ghoting A, Parthasarathy S, Otey ME (2015) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Dis 16:349–364MathSciNetCrossRef
Zurück zum Zitat Greco A, Perri S (2014) Identification of high shears and compressive discontinuities in the inner heliosphere. Astrophys J 784(2):163CrossRef Greco A, Perri S (2014) Identification of high shears and compressive discontinuities in the inner heliosphere. Astrophys J 784(2):163CrossRef
Zurück zum Zitat Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 293–298 Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 293–298
Zurück zum Zitat Jones MC, Henderson DA (2009) Maximum likelihood kernel density estimation: on the potential of convolution sieves. Comput Stat Data Anal 53:3726–3733MathSciNetCrossRefMATH Jones MC, Henderson DA (2009) Maximum likelihood kernel density estimation: on the potential of convolution sieves. Comput Stat Data Anal 53:3726–3733MathSciNetCrossRefMATH
Zurück zum Zitat Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large databases (VLDB’98), pp 392–403 Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large databases (VLDB’98), pp 392–403
Zurück zum Zitat Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large databases (VLDB’99), pp 211–222 Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large databases (VLDB’99), pp 211–222
Zurück zum Zitat Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 444–452 Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 444–452
Zurück zum Zitat Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Loop: local outlier probabilities. In: Proceedings of the ACM international conference on information and knowledge management (CIKM’09), pp 1649–1652 Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Loop: local outlier probabilities. In: Proceedings of the ACM international conference on information and knowledge management (CIKM’09), pp 1649–1652
Zurück zum Zitat Kriegel HP, Kroger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD’09), pp 831–838 Kriegel HP, Kroger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD’09), pp 831–838
Zurück zum Zitat Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the ACM SIGKDD conference on knowledge discovery in data (KDD’05), pp 157–166 Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the ACM SIGKDD conference on knowledge discovery in data (KDD’05), pp 157–166
Zurück zum Zitat Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the IEEE international conference on data mining (ICDM’08), pp 413–422 Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the IEEE international conference on data mining (ICDM’08), pp 413–422
Zurück zum Zitat Micenková B, Ng RT, Dang XH, Assent I (2013) Explaining outliers by subspace separability. In: Proceedings of the IEEE international conference on data mining (ICDM’13), pp 518–527 Micenková B, Ng RT, Dang XH, Assent I (2013) Explaining outliers by subspace separability. In: Proceedings of the IEEE international conference on data mining (ICDM’13), pp 518–527
Zurück zum Zitat Nguyen H, Gopalkrishnan V, Assent I (2011) An unbiased distance-based outlier detection approach for high dimensional data. In: Proceedings of the international conference on database systems for advanced applications (DASFAA), pp 138–152 Nguyen H, Gopalkrishnan V, Assent I (2011) An unbiased distance-based outlier detection approach for high dimensional data. In: Proceedings of the international conference on database systems for advanced applications (DASFAA), pp 138–152
Zurück zum Zitat Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the IEEE international conference on data enginnering (ICDE’03), pp 315–326 Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the IEEE international conference on data enginnering (ICDE’03), pp 315–326
Zurück zum Zitat Rousseeuw P, Leroy A (2003) Robust regression and outlier detection. Wiley, New YorkMATH Rousseeuw P, Leroy A (2003) Robust regression and outlier detection. Wiley, New YorkMATH
Zurück zum Zitat Salgado-Ugarte IH, Pérez-Hernández MA (2003) Exploring the use of variable bandwidth kernel density estimators. Stata J 3(2):133–147 Salgado-Ugarte IH, Pérez-Hernández MA (2003) Exploring the use of variable bandwidth kernel density estimators. Stata J 3(2):133–147
Zurück zum Zitat Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 252–257 Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 252–257
Zurück zum Zitat Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, LondonCrossRefMATH Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, LondonCrossRefMATH
Zurück zum Zitat Vinh NX, Chan J, Bailey J, Leckie C, Ramamohanarao K, Pei J (2015) Scalable outlying-inlying aspects discovery via feature ranking. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data (PAKDD’15), pp 422–434 Vinh NX, Chan J, Bailey J, Leckie C, Ramamohanarao K, Pei J (2015) Scalable outlying-inlying aspects discovery via feature ranking. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data (PAKDD’15), pp 422–434
Zurück zum Zitat Xiong L, Chen X, Schneider J (2011) Direct robust matrix factorization for anomaly detection. In: Proceedings of the IEEE international confence on data mining (ICDM’11), pp 844 – 853 Xiong L, Chen X, Schneider J (2011) Direct robust matrix factorization for anomaly detection. In: Proceedings of the IEEE international confence on data mining (ICDM’11), pp 844 – 853
Zurück zum Zitat Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the SIAM conference on data mining (SDM’09), pp 145–154 Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the SIAM conference on data mining (SDM’09), pp 145–154
Metadaten
Titel
Outlying property detection with numerical attributes
verfasst von
Fabrizio Angiulli
Fabio Fassetti
Giuseppe Manco
Luigi Palopoli
Publikationsdatum
29.03.2016
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 1/2017
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-016-0458-x

Weitere Artikel der Ausgabe 1/2017

Data Mining and Knowledge Discovery 1/2017 Zur Ausgabe