Top

Data Mining and Knowledge Discovery

Published in:

01-04-2021

Homophily outlier detection in non-IID categorical data

Authors: Guansong Pang, Longbing Cao, Ling Chen

Published in: Data Mining and Knowledge Discovery | Issue 4/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10–28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.

next article Mining full, inner and tail periodic patterns with perfect, imperfect and asynchronous periodicity simultaneously

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Having excessive thirsty, weight loss and frequent urination are the abnormal concurrent symptoms in diagnosing type-2 diabetes according to https://www.diabetesaustralia.com.au/.

We have ignored features with \( freq (m)=1\), as those features contain no useful information relevant to outlier detection.

\(\delta \) is normalized into the range in (0,1) to work well with \(\eta \).

The source codes of CBRW/SDRW-based outlier detection (or feature selection) algorithms are made publicly available at https://sites.google.com/site/gspangsite/sourcecode.

Since CompreX was implemented in a different programming language to the other methods, the runtime between CompreX and other methods is incomparable. Instead, we compare them in terms of runtime ratio, i.e., the runtime on a larger/higher-dimensional data set divided by that on a smaller/lower-dimensional data set, for a fairer comparison. Since the data size and the increasing factor of dimensionality are fixed, the runtime ratio is comparable across the methods in different programming languages.

Aggarwal CC (2017a) Outlier detection in categorical, text, and mixed attribute data. In: Outlier analysis, pp 249–272. Springer, Berlin

Aggarwal CC (2017b) Outlier analysis, second edn. Springer, BerlinMATHCrossRef

Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, pp 415–424. ACM

Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Disc 29(3):626–688MathSciNetCrossRef

Andersen R, Chellapilla K (2009) Finding dense subgraphs with size bounds. In: Algorithms and models for the web-graph, pp 25–37

Angiulli F, Palopoli L et al (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872MathSciNetMATHCrossRef

Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Datab Syst 34(1):7

Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2010) Outlier detection for simple default theories. Artif Intell 174(15):1247–1253MathSciNetMATHCrossRef

Azmandian F, Yilmazer A, Dy JG, Aslam J, Kaeli DR, et al (2012) GPU-accelerated feature selection for outlier detection using the local kernel density ratio. In ICDM, pp 51–60. IEEE

Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: SDM, pp 243–254. SIAM

Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2):93–104CrossRef

Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. ACM SIGMOD Record 26(2):265–276CrossRef

Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927MathSciNetCrossRef

Cao L (2014) Non-iidness learning in behavioral and social data. Comput J 57(9):1358–1370CrossRef

Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186CrossRef

Cao L (2018) Data science thinking: the next scientific. Technological and Economic Revolution, Springer, Berlin

Cao L, Yuming O, Philip SY (2012) Coupled behavior analysis with applications. IEEE Trans Knowl Data Eng 24(8):1378–1392CrossRef

Cao L, Dong X, Zheng Z (2016) e-nsp: Efficient negative sequential pattern mining. Artif Intell 235:156–182MathSciNetMATHCrossRef

Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15CrossRef

Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: Tera-scale graph mining and inference for malware detection. In: SDM, pp 131–142. SIAM

Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: KDD, pp 220–229. ACM

Diaconis P, Stroock D (1991) Geometric bounds for eigenvalues of markov chains. Ann Appl Probab 1(1):36–61MathSciNetMATHCrossRef

Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: KDD workshop, pp 16–21. ACM

Fan X, Xu RYD, Cao L (2016) Copula mixed-membership stochastic blockmodel. In: IJCAI, pp 1462–1468

Fill JA (1991) Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann Appl Probab 1(1):62–87MathSciNetMATHCrossRef

Fowler JH, Christakis NA (2008) Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ 337:a2338CrossRef

Ganiz MC, George C, Pottenger WM (2011) Higher order naive bayes: a novel non-iid approach to text classification. IEEE Trans Knowl Data Eng 23(7):1022–1034CrossRef

Giacometti A, Soulet A (2016) Anytime algorithm for frequent pattern outlier detection. Int J Data Sci Anal, pp 1–12

Gómez-Gardeñes J, Latora V (2008) Entropy rate of diffusion processes on complex networks. Phys Rev E 78(6):065102CrossRef

Guha S, Mishra N, Roy G, Schrijvers O (2016) Robust random cut forest based anomaly detection on streams. In: ICML, pp 2712–2721

Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129MATHCrossRef

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef

Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186MATHCrossRef

He J (2017) Learning from data heterogeneity: algorithms and applications. In: IJCAI, pp 5126–5130

He J, Carbonell J (2010) Coselection of features and instances for unsupervised rare category analysis. Stat Anal Data Min 3(6):417–430MathSciNetMATHCrossRef

He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118CrossRef

Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300CrossRef

Ienco D, Pensa RG, Meo R (2017) A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans Neural Netw Learn Syst 28(5):1017–1029CrossRef

Jian S, Cao L, Pang G, Lu K, Gao H (2017) Embedding-based representation of categorical data by hierarchical value coupling learning. In: IJCAI, pp 1937–1943

Khuller S, Barna S (2009) On finding dense subgraphs. Automata, Languages and Programming, pp 597–608

Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Disc 20(2):259–289MathSciNetCrossRef

Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3):697–725CrossRef

Koutra D, Ke T-Y, Kang U, Chau D, Pao H-K, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Machine learning and knowledge discovery in databases, pp 245–260

Leyva E, González A, Perez R (2015) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367CrossRef

Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2016) Feature selection: a data perspective. CoRR, arXiv:abs/1601.07996

Liang J, Parthasarathy S (2016) Robust contextual outlier detection: Where context meets sparsity. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 2167–2172. ACM

Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39CrossRef

Maldonado S, Weber R, Famili F (2014) Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Inf Sci 286:228–246CrossRef

McGlohon M, Bay S, Anderle MG, Steier DM, Faloutsos C (2009) SNARE: a link analytic system for graph labeling and risk detection. In: KDD, pp 1265–1274. ACM

McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Ann Rev Sociol 27(1):415–444CrossRef

Meyer CD (2000) Matrix analysis and applied linear algebra. SIAM, PhiladelphiaCrossRef

Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Disc 12(2–3):203–228MathSciNetCrossRef

Page L, Brin S, Motwani R, Winograd T (1998) The PageRank citation ranking: bringing order to the web. In: WWW conference, pp 161–172

Pang G, Ting KM, Albrecht D (2015) LeSiNN: detecting anomalies by identifying least similar nearest neighbours. In: ICDM workshop, pp 623–630. IEEE

Pang G, Cao L, Chen L (2016) Outlier detection in complex categorical data by modelling the feature value couplings. In IJCAI, pp 1902–1908

Pang G, Cao L, Chen L, Lian D, Liu H (2018) Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data. In: Thirty-second AAAI conference on artificial intelligence

Pang G, Shen C, Cao L, van den Hengel A (2020) Deep learning for anomaly detection: a review. arXiv preprint arXiv:2007.02500

Rayana S, Akoglu L (2016) Less is more: building selective anomaly ensembles. ACM Trans Knowl Discov Data 10(4):42CrossRef

Rayana S, Zhong W, Akoglu L (2016) Sequential ensemble learning for outlier detection: a bias-variance perspective. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 1167–1172. IEEE

Schubert E, Wojdanowski R, Zimek A, Kriegel H-P (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 2012 SIAM international conference on data mining, pp 1047–1058. SIAM

Smets K, Vreeken J (2011) The odd one out: identifying and characterising anomalies. In: SDM, pp 109–148. SIAM

Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256MathSciNetCrossRef

Sugiyama M, Borgwardt K (2013) Rapid distance-based outlier detection via sampling. In: NIPS, pp 467–475

Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synth Lect Data Min Knowl Discov 3(2):1–159CrossRef

Tang G, Pei J, Bailey J, Dong G (2015) Mining multidimensional contextual outliers from categorical relational data. Intell Data Anal 19(5):1171–1192CrossRef

Tang J, Gao H, Hu X, Liu H (2013) Exploiting homophily effect for trust prediction. In: WSDM, pp 53–62. ACM

Ting KM, Zhou GT, Liu FT, Tan SC (2013) Mass estimation. Mach Learn 90(1):127–160MathSciNetMATHCrossRef

Ting KM, Washio T, Wells JR, Aryal S (2017) Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Mach Learn 106(1):55–91MathSciNetMATHCrossRef

Wong W-K, Moore A, Cooper G, Wagner M (2003) Bayesian network anomaly pattern detection for disease outbreaks. In: ICML, pp 808–815

Shu W, Wang S (2013) Information-theoretic outlier detection for large-scale categorical data. IEEE Trans Knowl Data Eng 25(3):589–602CrossRef

Zhang Q, Cao L, Zhu C, Li Z, Sun J (2018) Coupledcf: learning explicit and implicit user-item couplings in recommendation for deep collaborative filtering. In: IJCAI’2018, pp 3662–3668

Zheng G, Brantley SL, Lauvaux T, Li Z (2017) Contextual spatial outlier detection with metric learning. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 2161–2170. ACM

Zhou Z-H, Sun Y-Y, Li Y-F (2009) Multi-instance learning by treating instances as non-iid samples. In: ICML, pp 1249–1256. ACM

Zimek A, Campello RJGB, Sander J (2013) Ensembles for unsupervised outlier detection: challenges and research questions. ACM SIGKDD Explor Newsl 15(1):11–22CrossRef

Title: Homophily outlier detection in non-IID categorical data
Authors: Guansong Pang
Longbing Cao
Ling Chen
Publication date: 01-04-2021
Publisher: Springer US
Published in: Data Mining and Knowledge Discovery / Issue 4/2021
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-021-00750-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2021

Correlations between random projections and the bivariate normal

Mining full, inner and tail periodic patterns with perfect, imperfect and asynchronous periodicity simultaneously

An overlap sensitive neural network for class imbalanced data

Streaming changepoint detection for transition matrices

Efficient set-valued prediction in multi-class classification

Adversarial balancing-based representation learning for causal effect inference with observational data

Premium Partner