survey

Anomaly Detection Methods for Categorical Data: A Review

Authors:
Ayman Taha

Faculty of Computers and Information, Cairo University, Giza, Egypt

Faculty of Computers and Information, Cairo University, Giza, Egypt
View Profile

,
Ali S. Hadi

American University in Cairo, Egypt, and Cornell University, Ithaca, NY, USA

American University in Cairo, Egypt, and Cornell University, Ithaca, NY, USA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 52 Issue 2Article No.: 38pp 1–35https://doi.org/10.1145/3312739

Published:30 May 2019Publication History

ACM Computing Surveys

Abstract

Anomaly detection has numerous applications in diverse fields. For example, it has been widely used for discovering network intrusions and malicious events. It has also been used in numerous other applications such as identifying medical malpractice or credit fraud. Detection of anomalies in quantitative data has received a considerable attention in the literature and has a venerable history. By contrast, and despite the widespread availability use of categorical data in practice, anomaly detection in categorical data has received relatively little attention as compared to quantitative data. This is because detection of anomalies in categorical data is a challenging problem. Some anomaly detection techniques depend on identifying a representative pattern then measuring distances between objects and this pattern. Objects that are far from this pattern are declared as anomalies. However, identifying patterns and measuring distances are not easy in categorical data compared with quantitative data. Fortunately, several papers focussing on the detection of anomalies in categorical data have been published in the recent literature. In this article, we provide a comprehensive review of the research on the anomaly detection problem in categorical data. Previous review articles focus on either the statistics literature or the machine learning and computer science literature. This review article combines both literatures. We review 36 methods for the detection of anomalies in categorical data in both literatures and classify them into 12 different categories based on the conceptual definition of anomalies they use. For each approach, we survey anomaly detection methods, and then show the similarities and differences among them. We emphasize two important issues, the number of parameters each method requires and its time complexity. The first issue is critical, because the performance of these methods are sensitive to the choice of these parameters. The time complexity is also very important in real applications especially in big data applications. We report the time complexity if it is reported by the authors of the methods. If it is not, then we derive it ourselves and report it in this article. In addition, we discuss the common problems and the future directions of the anomaly detection in categorical data.

References

Abror Abduvaliyev, Al-Sakib Khan Pathan, Jianying Zhou, Rodrigo Roman, and Wai-Choong Wong. 2013. On the vital areas of intrusion detection systems in wireless sensor networks. IEEE Commun. Surveys Tutor. 15, 3 (2013), 1223--1237.Google ScholarCross Ref
Hala Abukhalaf, Jianxin Wang, and Shigeng Zhang. 2015. Outlier detection techniques for localization in wireless sensor networks: A survey. Int. J. Future Gen. Commun. Netw. 8, 6 (2015), 99--114.Google Scholar
Charu C. Aggarwal. 2017. Outlier Analysis, 2nd ed. Springer, Cham. Google ScholarDigital Library
Charu C. Aggarwal and Philip S. Yu. 2001. Outlier detection for high dimensional data. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’01). 37--46. Google ScholarDigital Library
Charu C. Aggarwal, Yuchen Zhao, and Philip S. Yu. 2011. Outlier detection in graph streams. In Proceedings of the ACM IEEE International Conference on Data Engineering (ICDE’11). 399--409. Google ScholarDigital Library
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of International Conference on Very Large Data Bases (VLDB’94). 487--499. Google ScholarDigital Library
A. Agresti. 2010. Analysis of Ordinal Categorical Data (2nd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
A. Agresti. 2013. Categorical Data Analysis (3rd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
Malik Agyemang, Ken Barker, and Rada Alhajj. 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intell. Data Anal. 10(6) (2006), 521--538. Google ScholarDigital Library
Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu. 2016. A survey of network anomaly detection techniques. Netw. Comput. Appl. 60 (2016), 19--31. Google ScholarDigital Library
Mohiuddin Ahmed, Abdun Naser Mahmood, and Md. Rafiqul Islam. 2016. A survey of anomaly detection techniques in financial domain. Future Gen. Comput. Syst. 55 (2016), 278--288. Google ScholarDigital Library
P. Ajitha and E. Chandra. 2015. A survey on outliers detection in distributed data mining for big data. J. Basic Appl. Sci. Res. 5, 2 (2015), 31--38.Google Scholar
Leman Akoglu, Mary Mcglohon, and Christos Faloutsos. 2010. OddBall: Spotting anomalies in weighted graphs. In Proceedings of the Pacific Asia Knowledge Discovery and Data Mining (PAKDD’10). 420--431. Google ScholarDigital Library
Leman Akoglu, Hanghang Tong, and Danai Koutra. 2015. Graph-based anomaly detection and description: A survey. Data Min. Knowl. Discov. 29, 3 (2015), 626--688. Google ScholarDigital Library
Leman Akoglu, Hanghang Tong, Jilles Vreeken, and Christos Faloutsos. 2012. Fast and reliable anomaly detection in categorical data. In Proceedings of the ACM International Conference on Information and Knowledge Management, (CIKM’12). 415--424. Google ScholarDigital Library
Fabrizio Angiulli, Stefano Basta, and Clara Pizzuti. 2006. Distance-based detection and prediction of outliers. IEEE Trans. Knowl. Data Eng. 18(2) (2006), 145--160. Google ScholarDigital Library
Fabrizio Angiulli and Fabio Fassetti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the European Conference on the Principles of Data Mining and Knowledge Discovery. 19--26. Google ScholarDigital Library
Yagnik N. Ankur and Ajay Shanker Singh. 2014. Oulier analysis using frequent pattern mining: A review. Int. J. Comput. Sci. Info. Technol. 5, 1 (2014), 47--50.Google Scholar
N. Archana and S. S. Pawar. 2014. Survey on outlier pattern detection techniques for time-series data. Int. J. Sc. Res. 1, 1 (2014), 1852--1856.Google Scholar
Tony Bailetti, Mahmoud Gad, and Ahmed Shah. 2016. Intrusion learning: An overview of an emergent discipline. Technol. Innovat. Manage. Rev. 6, 2 (2016), 15--20.Google ScholarCross Ref
U. A. B. U. A. Bakar, Hemant Ghayvat, S. F. Hasanm, and S. C. Mukhopadhyay. 2016. Activity and anomaly detection in smart home: A survey. In Next Generation Sensors and Systems, Subhas Chandra Mukhopadhyay (Ed.). Springer, New York, NY, Chapter 9, 191--220.Google Scholar
Zuriana Abu Bakar, Rosmayati Mohemad, Akbar Ahmad, and Mustafa Mat Deris. 2006. A comparative study for outlier detection techniques in data mining. In Proceedings of IEEE International Conference on Cybernetics and Intelligent Systems. 1--6.Google ScholarCross Ref
V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley 8 Sons, New York, NY.Google Scholar
S. Bay and M. Schwabacher. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD. 29--38. Google ScholarDigital Library
Eric J. Beh. 2008. Simple correspondence analysis of nominal-ordinal contingency tables.J. Appl. Math. Decis. Sci. 228 (2008), 1--17.Google ScholarCross Ref
Alka P. Beldar and Vinod S. Wadne. 2015. The detail survey of anomaly/outlier detection methods in data mining. Int. J. Multidisc. Curr. Res. 3 (2015), 462--472.Google Scholar
Clauber Gomes Bezerra, Bruno Sielly Jales Costa, Luiz Affonso Guedes, and Plamen Parvanov Angelov. 2015. A comparative study of autonomous learning outlier detection methods applied to fault detection. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’15). 1--7.Google ScholarDigital Library
Kanishka Bhaduri, Bryan L. Matthews, and Chris R. Giannella. 2011. Algorithms for speeding up distance-based outlier detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, (SIGKDD’11). 895--867. Google ScholarDigital Library
Umale Bhagyashree and M. Nilav. 2014. Overview of k-means and expectation maximization algorithm for document clustering. In Proceedings of the International Conference on Quality Up-gradation in Engineering, Science and Technology (ICQUEST’14). 5--8.Google Scholar
N. Billor, Ali S. Hadi, and P. Velleman. 2000. Blocked adaptive computationally-efficient outlier nominators. Comput. Stat. Data Anal. 34 (2000), 279--298. Google ScholarDigital Library
Christian Böhm, Katrin Haegler, Nikola S Müller, and Claudia Plant. 2009. CoCo: Coding cost for parameter-free outlier detection. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, (SIGKDD’09). 149--158. Google ScholarDigital Library
Shyam Boriah, Varun Chandola, and Vipin Kumar. 2008. Similarity measures for categorical data: A comparative evaluation. In Proceedings of the International SIAM Data Mining Conference (SDM’08). 243--254.Google ScholarCross Ref
Mohamed Bouguessa. 2014. A mixture model-based combination approach for outlier detection. Int. J. Artific. Intell. Tools 23, 4 (2014), 1--21.Google Scholar
Mohamed Bouguessa. 2015. A practical outlier detection approach for mixed-attribute data. Expert Syst. Appl. 42 (2015), 8637--8649. Google ScholarDigital Library
M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. 2000. LOF: Identifying density--based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’00). 93--104. Google ScholarDigital Library
Guilherme O Campos, Arthur Zimek, Jörg Sander, Ricardo JGB Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E Houle. 2016. On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Min. Knowl. Discov. 30, 4 (2016), 891--927. Google ScholarDigital Library
E. Castillo, J. M. Gutiérrez, and A. S. Hadi. 1997. Expert Systems and Probabilistic Network Models. Springer-Verlag, New York, NY. Google ScholarDigital Library
V. Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Comput. Surveys 41(3) (2009), 1--58. Google ScholarDigital Library
V. Chandola, A. Banerjee, and V. Kumar. 2012. Anomaly detection for discrete sequences: A survey. Trans. Knowl. Data Eng. 24(5) (2012), 823--839. Google ScholarDigital Library
V. Chandola, S. Boriah, and V. Kumar. 2008. Understanding Categorical Similarity Measures for Outlier Detection. Technical Report. University of Minnesota, Department of Computer Science and Engineering, 1-46.Google Scholar
V. Chandola, S. Boriah, and V. Kumar. 2009. A framework for exploring categorical data. In Proceedings of the International SIAM Data Mining Conference (SDM’09). 187--198.Google Scholar
S. Chatterjee and Ali S. Hadi. 1986. Influential observations, high leverage points, and outliers in regression. Stat. Sci. 1 (1986), 379--416.Google ScholarCross Ref
S. Chatterjee and Ali S. Hadi. 1988. Sensitivity Analysis in Linear Regression. John Wiley 8 Sons, New York, NY. Google ScholarDigital Library
Sanjay Chawla and Pei Sun. 2006. SLOM: A new measure for local spatial outliers. Knowl. Info. Syst. 9 (2006), 412--429.Google ScholarDigital Library
Haibin Cheng, Pang-Ning Tan, Christopher Potter, and Steven A. Klooster. 2009. Detection and characterization of anomalies in multivariate time series. In Proceedings of the SIAM International Conference on Data Mining (SDM’09). 413--424.Google Scholar
HyungJun Cho and Soo-Heang Eo. 2016. Outlier detection for mass spectrometric data. In Statistical Analysis in Proteomics, Klaus Jung (Ed.). Springer, New York, NY, Chapter 5, 91--102.Google Scholar
Gregory F. Cooper. 1990. The computational complexity of probabilistic inference using Bayesian belief networks. Artific. Intell. 42 (1990), 393--405. Google ScholarDigital Library
Denis Cousineau and Sylvain Chartier. 2015. Outliers detection and treatment: A review. Int. J. Psychol. Res. 3, 1 (2015), 58--67.Google ScholarCross Ref
J. Vijay Daniel, S. Joshna, and P. Manjula. 2013. A survey of various intrusion detection techniques in wireless sensor networks. Int. J. Comput. Sci. Mobile Comput. 2, 9 (2013), 235--246.Google Scholar
K. Das and J. Schneider. 2007. Detecting anomalous records in categorical datasets. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’07). 220--229. Google ScholarDigital Library
K. Das, J. Schneider, and D. B. Neill. 2008. Anomaly pattern detection in categorical datasets. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 169--176. Google ScholarDigital Library
Dhwani Dave and Tanvi Varma. 2014. A review of various statistical methods for outlier detection. Int. J. Comput. Sci. Eng. Technol. 5, 2 (2014), 137--140.Google Scholar
Herv Debar, Marc Dacier, and Andreas Wespi. 1999. Towards a taxonomy of intrusion-detection systems. Comput. Netw. 31, 9 (1999), 805--822. Google ScholarDigital Library
Alfonso Iodice D’Enza and Michael Greenacre. 2012. Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In Advanced Statistical Methods for the Analysis of Large Data-Sets, Agostino Di Ciaccio, Mauro Coli, and Jose Miguel Angulo Ibañez (Eds.). Springer, 453--463.Google Scholar
Mr. Mukesh K. Deshmukh and A. S. Kapse. 2016. A survey on outlier detection technique in streaming data using data clustering approach. Int. J. Engineering and Computer Science 5, 1 (2016), 15453--15456.Google Scholar
Christian Desrosiers and George Karypis. 2011. A comprehensive survey of neighborhood-based recommendation methods. In Recommender Systems Handbook. Springer-Verlag New York, NY, 107--144.Google Scholar
R. Lakshmi Devi and R. Amalraj. 2015. Hubness in unsupervised outlier detection techniques for high dimensional data--A survey. Int. J. Comput. Appl. Technol. Res. 4, 11 (2015), 797--801.Google Scholar
Jiten Harishbhai Dhimmar and Raksha Chauhan. 2014. A survey on profile-injection attacks in recommender systems using outlier analysis. Int. J. Adv. Res. Comput. Sci. Manage. Studies 2, 12 (2014), 356--359.Google Scholar
Xuemei Ding, Yuhua Li, Ammar Belatreche, and Liam P. Maguire. 2014. An experimental evaluation of novelty detection methods. Neurocomputing 135 (2014), 313--327. Google ScholarDigital Library
K. T. Divya and N. S. Kumaran. 2016. Survey on outlier detection techniques using categorical data. Int. Res. J. Eng. Technol. 3 (2016), 899--904.Google Scholar
Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, and Pang-Ning Tan. 2002. Data mining for network intrusion detection. In Proceedings of the NSF Workshop on Next Generation Data Mining. 21--30.Google Scholar
Jin Du, Qinghua Zheng, Haifei Li, and Wenbin Yuan. 2005. The research of mining association rules between personality and behavior of learner under web-based learning environment. In Proceedings of the the International Conference on Advances in Web-Based Learning (ICWL’05). 15--26. Google ScholarDigital Library
David Ebdon. 1991. Statistics in Geography: A Practical Approach-Revised with 17 Programs. Wiley-Blackwell, Hoboken, NJ.Google Scholar
Syed Masum Emran and Nong Ye. 2001. Robustness of Canberra metric in computer intrusion detection. In Proceedings of the IEEE Workshop on Information Assurance and Security. New York, NY, 80--84.Google Scholar
Hadi Fanaee-T and João Gama. 2016. Tensor-based anomaly detection: An interdisciplinary survey. Knowl-Based Syst. 98 (2016), 130--147. Google ScholarDigital Library
Elaine R. Faria, Isabel J. C. R. Goncalves, A. C. P. L. F. de Carvalho, and J. Gama. 2015. Novelty detection in data streams. Artific. Intell. Rev. 45, 2 (2015), 235--269. Google ScholarDigital Library
E. W. Forgy. 1965. Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21 (1965), 768--780.Google Scholar
A. Frank and A. Asuncion. 2018. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml/datasets.html.Google Scholar
Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han. 2010. On community outliers and their efficient detection in information networks. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 813--822. Google ScholarDigital Library
Pedro Garcia-Teodoro, J. Diaz-Verdejo, Gabriel Maciá-Fernández, and Enrique Vázquez. 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Security 28, 1 (2009), 18--28. Google ScholarDigital Library
Yong Ge, Hui Xiong, Zhi-Hua Zhou, Hasan Ozdemir, Jannite Yu, and K. C. Lee. 2010. TOP-EYE: Top-k evolving trajectory outlier detection. In Proceedings of the ACM Conference on Information and Knowledge Management, (CIKM’10). 1--4. Google ScholarDigital Library
Dhiren Ghosh and Andrew Vogt. 2012. Outliers: An evaluation of methodologies. In Proceedings of the Joint Statistical Meetings. American Statistical Association, 3455--3460.Google Scholar
A. Ghoting, M. E. Otey, and S. Parthasarathy. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. In Proceedings of the IEEE International Conference on Data Mining (ICDM’04). 387--390. Google ScholarDigital Library
Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. 2008. Fast mining of distance-based outliers in high dimensional datasets. Data Min. Knowl. Discov. J. 16(3) (2008), 349--364. Google ScholarDigital Library
Prasanta Gogoi, D. K. Bhattacharyya, Bhogeswar Borah, and Jugal K. Kalita. 2011. A survey of outlier detection methods in network anomaly identification. Comput. J. 54, 4 (2011), 570--588. Google ScholarDigital Library
Gene H. Golub and Charles F. van Loan. 2012. Matrix Computations, 3rd ed. John Hopkins University Press. Google ScholarDigital Library
Geoffrey Grimmett and David Stirzaker. 2001. Probability and Random Processes, 3rd ed. Oxford University Press, Oxford, UK.Google Scholar
V. Gunamani and M. Abarna. 2013. A survey on intrusion detection using outlier detection techniques. Int. J. Sci. Eng. Technol. Res. 2, 11 (2013), 2063 --2068.Google Scholar
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data. Synth. Lect. Data Min. Knowl. Discov. 5, 1 (2014), 1--129.Google ScholarDigital Library
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier detection for temporal data: A survey. IEEE Trans. Knowl. Data Eng. 26, 9 (2014), 2250--2267.Google ScholarCross Ref
Ali S. Hadi. 1992. Identifying multiple outliers in multivariate data. J. Roy. Stat. Soc., Ser. B 54 (1992), 761--771.Google Scholar
Ali S. Hadi. 1992. A new measure of overall potential influence in linear regression. Comput. Stat. Data Anal. 14 (1992), 1--27. Google ScholarDigital Library
Ali S. Hadi. 1994. A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc., Ser. B 56 (1994), 393--396.Google Scholar
Ali S. Hadi, A. H. M. Rahmatullah Imon, and Mark Werner. 2009. Detection of outliers. Wiley Interdisc. Rev.: Comput. Stat. 1 (2009), 57--70.Google ScholarDigital Library
Ali S. Hadi and J. S. Simonoff. 1993. Procedure for the identification of outliers in linear models. J. Amer. Stat. Assoc. 88 (1993), 1264--1272.Google ScholarCross Ref
Xiaojuan Han, Yong Yan, Cheng Cheng, Yueyan Chen, and Yanglin Zhu. 2014. Monitoring of oxygen content in the flue gas at a coal-fired power plant using cloud modeling techniques. IEEE Trans. Instrument. Measure. 63, 4 (2014), 953--963.Google ScholarCross Ref
Z. He, X. Xu, and S. Deng. 2005. An optimization model for outlier detection in categorical data. In Proceedings of the International Conference on Advances in Intelligent Computing. 400--409. Google ScholarDigital Library
Z. He, X. Xu, and S. Deng. 2006. A fast greedy algorithm for outlier mining. In Proceedings of the Pacific Asia Knowledge Discovery and Data Mining (PAKDD’06). Singapore, 567--576. Google ScholarDigital Library
Z. He, X. Xu, J. Z. Huang, and S. Deng. 2005. FP-outlier: Frequent pattern based outlier detection. Comput. Sci. Info. Syst. 2 (2005), 726--732.Google Scholar
S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. 2011. Statistical outlier detection using direct density ratio estimation. Knowl. Info. Syst. 26, 2 (2011), 309--336.Google ScholarDigital Library
V. J Hodge and J. Austin. 2004. A survey of outlier detection methodologies. Artific. Intell. Rev. 22 (2004), 85--126. Google ScholarDigital Library
Zhexue Huang. 1997. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proceedings of the International Data Mining and Knowledge Discovery (DMKM’97), Workshop at the ACM International Conference on Mangagement of Data (SIGKDD). 1--8.Google Scholar
Z. Huang and M. K. Ng. 1999. A fuzzy k-modes algorithm for clustering categoircal data. IEEE Trans. Fuzzy Syst. 7 (1999), 446--452. Google ScholarDigital Library
Dino Ienco, Ruggero G. Pensa, and Rosa Meo. 2012. From context to distance: Learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data 6, 1 (2012), 1--12. Google ScholarDigital Library
Dino Ienco, Ruggero G. Pensa, and Rosa Meo. 2017. A semisupervised approach to the detection and characterization of outliers in categorical data. IEEE Trans. Neural Netw. Learn. 28, 5 (2017), 1017--1029.Google ScholarCross Ref
Francesca Ieva and Anna Maria Paganoni. 2015. Detecting and visualizing outliers in provider profiling via funnel plots and mixed effect models. Health Care Manage. Sci. 18, 2 (2015), 166--172.Google Scholar
ShengYi Jiang, Xiaoyu Song, Hui Wang, Jian-Jun Han, and Qing-Hua Li. 2006. A clustering-based method for unsupervised intrusion detections. Pattern Recogn. Lett. 27 (2006), 802--810. Google ScholarDigital Library
Vineet Joshi and Raj Bhatnagar. 2014. CBOF: Cohesiveness-based outlier factor a novel definition of outlier-ness. In Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition (MLDM’14). 175--189.Google ScholarCross Ref
Hossein Joudaki, Arash Rashidian, Behrouz Minaei-Bidgoli, Mahmood Mahmoodi, Bijan Geraili, Mahdi Nasiri, and Mohammad Arab. 2015. Using data mining to detect health care fraud and abuse: A review of literature. Global J. Health Sci. 7, 1 (2015), 194--202.Google Scholar
Leonid Kalinichenko, Ivan Shanin, and Ilia Taraban. 2014. Methods for anomaly detection: A survey. In Proceedings of the All-Russian Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections (RCDL’14). 20--25.Google Scholar
V. Kathiresan and N. A. Vasanthi. 2015. A survey on outlier detection techniques useful for financial card fraud detection. Int. J. Innovat. Eng. Technol. 6, 1 (2015), 226--235.Google Scholar
Ravneet Kaur and Sarbjeet Singh. 2015. A survey of data mining and social network analysis based anomaly detection techniques. Egypt. Info. J. 39 (2015), 1--18.Google Scholar
E. M. Knorr, R. T. Ng, and V. Tucakov. 2000. Distance-based outliers: Algorithms and applications. VLDB J. 8 (2000), 237--253. Google ScholarDigital Library
Edwin M. Knorr and Raymond T. Ng. 1997. A unified approach for mining outliers. In Proceedings of the International Conference of the Centre for Advanced Studies on Collaborative Research (CASCON’97). 236--248. Google ScholarDigital Library
A. Koufakou, M. Georgiopoulos, and G. Anagnostopoulos. 2008. Detecting outliers in high-dimensional datasets with mixed attributes. In Proceedings of the International Conference on Data Mining (DMIN’08).Google Scholar
A. Koufakou, E. Ortiz, M. Georgiopoulos, G. Anagnostopoulos, and K. Reynolds. 2007. A scalable and efficient outlier detection strategy for categorical data. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence (ICTAI’07). 210--217. Google ScholarDigital Library
Anna Koufakou, Jimmy Secretan, and Michael Georgiopoulos. 2011. Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl. Info. Syst. 29, 3 (2011), 697--725. Google ScholarDigital Library
Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining (SDM’03). 25--36.Google ScholarCross Ref
Dajiang Lei, Liping Zhang, and Lisheng Zhang. 2013. Cloud model-based outlier detect algorithm for categorical data. Int. J. Database Theory Appl. 6, 14 (2013), 199--213.Google Scholar
Deyi Li. 2000. Uncertainty in knowledge representation. Chinese Eng. Sci. 2, 10 (2000), 73--79.Google Scholar
Jingchao Li and Jian Guo. 2015. A new feature extraction algorithm based on entropy cloud characteristics of communication signals. Math. Problems Eng. 2015 (2015), 1--8.Google Scholar
Junli Li, Jifu Zhang, Ning Pang, and Xiao Qin. 2018. Weighted outlier detection of high-dimensional categorical data using feature grouping. IEEE Trans. Syst. Man Cybernet.: Syst. (2018), 1--14.Google Scholar
Shuxin Li, Robert Lee, and Sheau-Dong Lang. 2007. Mining distance-based outliers from categorical data. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDM’07). 225--230. Google ScholarDigital Library
J. Y. Liang, K. S. Chin, and C. Y. Dang. 2002. A new method for measuring uncertainty and fuzziness in rough set theory. Int. J. Gen. Syst. 31 (2002), 331--342.Google ScholarCross Ref
Song Lin and Donald E. Brown. 2006. An outlier-based data association method for linking criminal incidents. Decis. Support Syst. 41 (2006), 604--615. Google ScholarDigital Library
Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xing Xie. 2011. Discovering spatio-temporal causal interactions in traffic data streams. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1010--1018. Google ScholarDigital Library
Xutong Liu, Feng Chen, and Chang-Tien Lu. 2014. On detecting spatial categorical outliers. GeoInformatica 18, 3 (2014), 501--536. Google ScholarDigital Library
Arunanshu Mahapatro and Pabitra Mohan Khilar. 2013. Fault diagnosis in wireless sensor networks: A survey. IEEE Commun. Surveys Tutor. 15, 4 (2013), 2000--2026.Google ScholarCross Ref
Kamal Malik, H. Sadawarti, and G. S. Kalra. 2014. Comparative analysis of outlier detection techniques. Int. J. Comput. Appl. 97, 8 (2014), 12--21.Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google Scholar
José Marinho, Jorge Granjal, and Edmundo Monteiro. 2015. A survey on security attacks and countermeasures with primary user detection in cognitive radio networks. EURASIP J. Info. Secur. 2015, 1 (2015), 1--14.Google ScholarCross Ref
Markos Markou and Sameer Singh. 2003. Novelty detection: A review-part 1: Statistical approaches. Signal Process. 83 (2003), 2481--2497. Google ScholarDigital Library
Markos Markou and Sameer Singh. 2003. Novelty detection: A review-part 2: Neural network based approaches. Signal Process. 83 (2003), 2499--2521. Google ScholarDigital Library
Manoj Mishra and Nitesh Gupta. 2015. To detect outlier for categorical data streaming. Int. J. Sci. Eng. Res. 6, 5 (2015), 1--5.Google Scholar
Andrew Moore, Mary Soon Lee, and Brigham Anderson. 1998. Cached sufficient statistics for efficient machine learning with large datasets. J. Artific. Intell. Res. 8 (1998), 67--91. Google ScholarDigital Library
Andrew Moore and W. K. Wong. 2003. Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In Proceedings of the 20th International Conference on Machine Learning. 552--559. Google ScholarDigital Library
Kazuyo Narita and Hiroyuki Kitagawa. 2008. Detecting outliers in categorical record databases based on attribute associations. In Progress in WWW Research and Development. Springer, Berlin, 111--123. Google ScholarDigital Library
K. Noto, C. Brodley, and D. Slonim. 2010. Anomaly detection using an ensemble of feature models. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). 953--958. Google ScholarDigital Library
K. Noto, C. Brodley, and D. Slonim. 2012. FRaC: A feature-modeling approach for semi-supervised and unsupervised anomaly detection. Data Min. Knowl. Discov. 25, 1 (2012), 109--133. Google ScholarDigital Library
Colin O’Reilly, Alexander Gluhak, Muhammad Ali Imran, and Sutharshan Rajasegarar. 2014. Anomaly detection in wireless sensor networks in a non-stationary environment. IEEE Commun. Surveys Tutor. 16, 3 (2014), 1413--1432.Google ScholarCross Ref
M. E. Otey, A. Ghoting, and S. Parthasarathy. 2006. Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Discov. 12, 2--3 (May 2006), 203--228. Google ScholarDigital Library
Matthew Eric Otey, Srinivasan Parthasarathy, and Amol Ghoting. 2005. An empirical comparison of outlier detection algorithms. In Proceedings of the International Workshop on Data Mining Methods for Anomaly Detection at ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD’05). 1--8.Google Scholar
Guansong Pang, Longbing Cao, and Ling Chen. 2016. Outlier detection in complex categorical data by modeling the feature value couplings. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 1902--1908. Google ScholarDigital Library
Animesh Patcha and Jung-Min Park. 2007. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Comput. Netw. 51(12) (2007), 3448--3470. Google ScholarDigital Library
M. S. Pawar, D. Amruta, and S. N. Tambe. 2014. A survey on outlier detection techniques for credit card fraud detection. IOSR J. Comput. Eng. 16, 2 (2014), 44--48.Google ScholarCross Ref
Zdzisław Pawlak. 1982. Rough sets. Int. J. Comput. Info. Sci. 11, 5 (1982), 341--356.Google ScholarCross Ref
C. Phua, D. Alahakoon, and V. Lee. 2004. Minority report in fraud detection: Classification of skewed data. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 50--59. Google ScholarDigital Library
Clifton Phua, Vincent C. S. Lee, Kate Smith-Miles, and Ross W. Gayler. 2010. A comprehensive survey of data mining-based fraud detection research. Retrieved from http://arxiv.org/abs/1009.6119.Google Scholar
Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal Process. 99 (2014), 215--249. Google ScholarDigital Library
Srijoni Saha Pradip, Jesica Fernandes Robert, and Jasmine Faujdar Hamza. 2015. Information-theoretic outlier detection for large-scale categorical data. Int. J. Comput. Sci. Mobile Comput. 4, 4 (2015), 873--881.Google Scholar
Raghav M. Purankar and Pragati Patil. 2015. A survey paper on an effective analytical approaches for detecting outlier in continuous time variant data stream. Int. J. Eng. Comput. Sci. 4, 11 (2015), 14946--14949.Google Scholar
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’00). 427--438. Google ScholarDigital Library
Stephen Ranshous, Shitian Shen, Danai Koutra, Steve Harenberg, Christos Faloutsos, and Nagiza F. Samatova. 2015. Anomaly detection in dynamic networks: A survey. Wiley Interdisc. Rev.: Comput. Stat. 7, 3 (2015), 223--247. Google ScholarDigital Library
Lida Rashidi, Sattar Hashemi, and Ali Hamzeh. 2011. Anomaly detection in categorical datasets using Bayesian networks. In Proceedings of the 3rd International Conference on Artificial Intelligence and Computational Intelligence, Part II (AICI’11). 610--619. Google ScholarDigital Library
Murad A. Rassam, M. A. Maarof, and Anazida Zainal. 2012. A survey of intrusion detection schemes in wireless sensor networks. Amer. J. Appl. Sci. 9, 10 (2012), 1636--1652.Google ScholarCross Ref
Murad A. Rassam, Anazida Zainal, and Mohd Aizaini Maarof. 2013. Advancements of data anomaly detection research in wireless sensor networks: A survey and open issues. Sensors 13, 8 (2013), 10087--10122.Google ScholarCross Ref
D. Lakshmi Sreenivasa Reddy, B. Raveendra Babu, and A. Govardhan. 2013. Outlier analysis of categorical data using navf. Informat. Econom. 17, 1 (2013), 1--5.Google Scholar
Abdolazim Rezaei, Zarinah M. Kasirun, Vala Ali Rohani, and Touraj Khodadadi. 2013. Anomaly detection in online social networks using structure-based technique. In Proceedings of the International Conference for Internet Technology and Secured Transactions (ICITST’13). 619--622.Google Scholar
Ritika, Tarun Kumar, and Amandeep Kaur. 2013. Outlier detection in WSN: A survey. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 7 (2013), 609--617.Google Scholar
N. Rokhman, Subanar, and E. Winarko. 2016. Improving the performance of outlier detection methods for Categorical data by using weighting function. J. Theor. Appl.d Info.n Technol. 83 (2016), 327--336.Google Scholar
Peter J. Rousseeuw and Katrien Van Driessen. 1998. A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 (1998), 212--223. Google ScholarDigital Library
Ashwini G. Sagade and Ritesh Thakur. 2014. Excess entropy based outlier detection in categorical data set. Int. J. Adv. Comput. Eng. Netw. 2, 8 (2014), 56--61.Google Scholar
Aiman Moyaid Said, Dhanapal Durai Dominic, and Brahim Belhaouari Samir. 2013. Outlier detection scoring measurements based on frequent pattern technique. Res. J. Appl. Sci. Eng. Technol. 6, 8 (2013), 1340--134.Google ScholarCross Ref
Arif Sari. 2015. A review of anomaly detection systems in cloud networks and survey of cloud security measures in cloud storage applications. J. Info. Secur. 6, 2 (2015), 142--154.Google ScholarCross Ref
Debajit Sen Sarma and Samar Sen Sarma. 2015. A survey on different graph based anomaly detection techniques. Indian J. Sci. Technol. 8, 31 (2015), 1--7.Google ScholarCross Ref
David Savage, Xiuzhen Zhang, Xinghuo Yu, Pauline Chou, and Qingmai Wang. 2014. Anomaly detection in online social networks. Soc. Netw. 39 (2014), 62--70.Google ScholarCross Ref
Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Alex J. Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Comput. 13, 7 (2001), 1443--1471. Google ScholarDigital Library
Junhee Seok and Yeong Seon Kang. 2015. Mutual information between discrete variables with many categories using recursive adaptive partitioning. Sci. Rep. 5 (2015), 1--10.Google Scholar
Nauman Shahid, Ijaz Haider Naqvi, and Saad Bin Qaisar. 2015. Characteristics and classification of outlier detection techniques for wireless sensor networks in harsh environments: A survey. Artific. Intell. Rev. 43, 2 (2015), 193--228. Google ScholarDigital Library
Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell Tele. Syst. Techn. Publ. 27, 3 (1948), 379--423.Google ScholarCross Ref
Deep Shikha Shukla, Avinash Chandra Pandey, and Ankur Kulhari. 2014. Outlier detection: A survey on techniques of WSNs involving event and error based outliers. In Proceedings of the International Conference of Innovative Applications of Computational Intelligence on Power, Energy and Controls with their Impact on Humanity (CIPECH’14). 113--116.Google ScholarCross Ref
M. Shyu, K. Sarinnapakorn, I. Kuruppu-Appuhamilage, S. Chen, L. W. Chang, and T. Goldring. 2005. Handling nominal features in anomaly intrusion detection problems. In Proceedings of the International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications. 55--62. Google ScholarDigital Library
Karanjit Singh and Shuchita Upadhyaya. 2012. Outlier detection: Applications and techniques. Int. J. Comput. Sci. Iss. 9, 1 (2012), 307--323.Google Scholar
Koen Smets and Jilles Vreeken. 2011. The odd one out: Identifying and characterising anomalies. In Proceedings of the SIAM International Conference on Data Mining (SDM’11). 804--815.Google ScholarCross Ref
Angela A. Sodemann, Matthew P. Ross, and Brett J. Borghetti. 2012. A review of anomaly detection in automated surveillance. IEEE Trans. Syst. Man Cybernet., Part C: Appl. Rev. 42, 6 (2012), 1257--1272. Google ScholarDigital Library
Garule Supriya and Sharmila M. Shinde. 2015. Outliers detection using subspace method: A survey. Int. J. Comput. Appl. 112, 16 (2015), 20--22.Google Scholar
N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2012. An algorithm for mining outliers in categorical data through ranking. In Proceedings of the 12th IEEE International Conference on Hybrid Intelligent Systems (HIS’12). 247--252.Google Scholar
N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2013. A rough clustering algorithm for mining outliers in categorical data. In Proceedings of the 4th International Conference on Pattern Recognition and Machine Intelligence (PReMI’13). 170--175.Google Scholar
N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2014. A ranking-based algorithm for detection of outliers in categorical data. Int. J. Hybrid Intell. Syst. 11 (2014), 1--11. Google ScholarDigital Library
N. N. R. R. Suri, M. N. Murty, and G. Athithan. 2016. Detecting outliers in categorical data through rough clustering. Nat. Comput. 15 (2016), 385--394. Google ScholarDigital Library
Ayman Taha and Ali S. Hadi. 2013. A general approach for automating outliers identification in categorical data. In Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA’13). 1--8.Google Scholar
Ayman Taha and Ali S. Hadi. 2016. Pair-wise association for categorical and mixed attributes. Info. Sci. 346 (2016), 73--89. Google ScholarDigital Library
Ayman Taha and Osman Hegazy. 2010. A proposed outliers identification algorithm for categorical data sets. In Proceedings of International Conference on Informatics and Systems (INFOS’10). 1--5.Google Scholar
Yun Wang. 2008. Statistical Techniques for Network Security: Modern Statistically-Based Intrusion Detection and Protection. IGI Global, New York, NY. Google ScholarDigital Library
Yibo Wang and Wei Xu. 2018. Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decis. Support Syst. 105 (2018), 87--95.Google ScholarCross Ref
Li Wei, Weining Qian, Aoying Zhou, Wen Jin, and Jeffrey X. Yu. 2003. Hypergraph-based outlier test for categorical data. In Proceedings of the ACM International Conference on Knowledge Discovery and data Mining (SIGKDD’03). 399--410. Google ScholarDigital Library
David J. Weller-Fahy, Brett J. Borghetti, and Angela A. Sodemann. 2015. A survey of distance and similarity measures used within network intrusion anomaly detection. IEEE Commun. Surveys Tutor. 17, 1 (2015), 70--91.Google ScholarDigital Library
Jarrod West and Maumita Bhattacharya. 2016. Intelligent financial fraud detection: A comprehensive review. Comput. Secur. 57 (2016), 47--66. Google ScholarDigital Library
Shu Wu and Shengrui Wang. 2011. Parameter-free anomaly detection for categorical data. Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science 6871 (2011), 112--126. Google ScholarDigital Library
Shu Wu and Shengrui Wang. 2013. Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25, 3 (2013), 589--602. Google ScholarDigital Library
Warusia Yassin, Nur Izura Udzir, Zaiton Muda, and Nasir Sulaiman. 2013. Anomaly-based intrusion detection through k-means clustering and naives Bayes classification. In Proceedings of the International Conference on Computing and Informatics (ICOCI’13). 298--303.Google Scholar
Jeffrey Xu Yu, Weining Qian, Hongjun Lu, and Aoying Zhou. 2006. Finding centric local outliers in categorical/numerical spaces. Knowl. Info. Syst. 9 (2006), 309--338.Google ScholarDigital Library
Rose Yu, Huida Qiu, Zhen Wen, Ching-Yung Lin, and Yan Liu. 2016. A survey on social media anomaly detection. Retrieevd from http://arxiv.org/pdf/1601.01102.Google Scholar
Ji Zhang. 2013. Advancements of outlier detection: A survey. ICST Trans. Scal. Info. Syst. 13, 1 (2013), 1--26.Google Scholar
Yang Zhang, Nirvana Meratnia, and Paul Havinga. 2010. Outlier detection techniques for wireless sensor networks: A survey. IEEE Commun. Surveys Tutor. 12, 2 (2010), 159--170.Google ScholarDigital Library
Xingwang Zhao, Jiye Liang, and Fuyuan Cao. 2014. A simple and effective outlier detection algorithm for categorical data. Int. J. Mach. Learn. Cybernet. 5 (2014), 469--477.Google ScholarCross Ref
Wobbe P. Zijlstra, L. Andries van der Ark, and Klaas Sijtsma. 2011. Outliers in questionnaire data: Can they be detected and should they be removed. J. Edu. Behav. Stat. 36 (2011), 186--212.Google ScholarCross Ref
Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. 2012. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5, 5 (2012), 363--387. Google ScholarDigital Library

Index Terms

Anomaly Detection Methods for Categorical Data: A Review
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Anomaly pattern detection in categorical datasets
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

We propose a new method for detecting patterns of anomalies in categorical datasets. We assume that anomalies are generated by some underlying process which affects only a particular subset of the data. Our method consists of two steps: we first use a "...
Read More
Unsupervised Anomaly Detection in Stream Data with Online Evolving Spiking Neural Networks
Abstract
Unsupervised anomaly discovery in stream data is a research topic with many practical applications. However, in many cases, it is not easy to collect enough training data with labeled anomalies for supervised learning of an anomaly ...
Read More
Deep Learning for Anomaly Detection: Challenges, Methods, and Opportunities
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

In this tutorial we aim to present a comprehensive survey of the advances in deep learning techniques specifically designed for anomaly detection (deep anomaly detection for short). Deep learning has gained tremendous success in transforming many data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 52, Issue 2
March 2020
770 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3320149
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering
Issue’s Table of Contents
Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 May 2019
- Revised: 1 January 2019
- Accepted: 1 January 2019
- Received: 1 January 2017
Published in csur Volume 52, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Computational complexity
Shannon entropy
data mining
holo entropy
intrusion detection systems
mixed data
nominal data
novelty detection
outliers detection
semi-supervised learning
supervised learning
unsupervised learning
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 2,717
  Total Downloads
- Downloads (Last 12 months)390
- Downloads (Last 6 weeks)57
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Anomaly Detection Methods for Categorical Data: A Review

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Anomaly pattern detection in categorical datasets

Unsupervised Anomaly Detection in Stream Data with Online Evolving Spiking Neural Networks

Deep Learning for Anomaly Detection: Challenges, Methods, and Opportunities