Abstract
Ensembles for unsupervised outlier detection is an emerging topic that has been neglected for a surprisingly long time (although there are reasons why this is more difficult than supervised ensembles or even clustering ensembles). Aggarwal recently discussed algorithmic patterns of outlier detection ensembles, identified traces of the idea in the literature, and remarked on potential as well as unlikely avenues for future transfer of concepts from supervised ensembles. Complementary to his points, here we focus on the core ingredients for building an outlier ensemble, discuss the first steps taken in the literature, and identify challenges for future research.
- N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504--509, 2006. Google ScholarDigital Library
- C. C. Aggarwal. Outlier ensembles {position paper}. ACM SIGKDD Explorations, 14(2):49--58, 2012. Google ScholarDigital Library
- C. C. Aggarwal. Outlier Analysis. Springer, 2013. Google ScholarDigital Library
- F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Helsinki, Finland, pages 15--26, 2002. Google ScholarDigital Library
- H. Ayad and M. Kamel. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, pages 166--175, 2003. Google ScholarDigital Library
- J. Azimi and X. Fern. Adaptive cluster ensemble selection. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena, CA, pages 992--997, 2009. Google ScholarDigital Library
- K. Bache and M. Lichman. UCI machine learning repository, 2013.Google Scholar
- V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3rd edition, 1994.Google Scholar
- M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 93--104, 2000. Google ScholarDigital Library
- G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.Google ScholarCross Ref
- R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 107--118, 2006. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):Article 15, 1--58, 2009. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823--839, 2012. Google ScholarDigital Library
- X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL, 2014.Google Scholar
- X. H. Dang, B. Micenkova, I. Assent, and R. Ng. Outlier detection with space transformation and spectral analysis. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 225--233, 2013.Google ScholarCross Ref
- T. G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, pages 1--15, 2000. Google ScholarDigital Library
- P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, pages 105--112, 1996.Google Scholar
- A. F. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong. Systematic construction of anomaly detection benchmarks from real data. In Workshop on Outlier Detection and Description, held in conjunction with the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013. Google ScholarDigital Library
- X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, pages 186--193, 2003.Google Scholar
- X. Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128--141, 2008. Google ScholarDigital Library
- A. L. N. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835--850, 2005. Google ScholarDigital Library
- I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.Google Scholar
- J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 212--221, 2006. Google ScholarDigital Library
- J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011. Google ScholarDigital Library
- A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007. Google ScholarDigital Library
- F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1--21, 1969.Google ScholarCross Ref
- S. T. Hadjitodorov and L. I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In 7th International Workshop on Multiple Classifier Systems (MCS), Prague, Czech Republic, pages 200--209, 2007. Google ScholarDigital Library
- S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006. Google ScholarDigital Library
- J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:29--36, 1982.Google ScholarCross Ref
- D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.Google ScholarCross Ref
- M. S. Hossain, S. Tadepalli, L. T. Watson, I. Davidson, R. F. Helm, and N. Ramakrishnan. Unifying dependent clustering and disparate clustering for nonhomogeneous data. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 593--602, 2010. Google ScholarDigital Library
- L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, 1985.Google ScholarCross Ref
- N. Iam-On and T. Boongoen. Comparative study of matrix refinement approaches for ensemble clustering. Machine Learning, 2013.Google Scholar
- W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, pages 577--593, 2006. Google ScholarDigital Library
- L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.Google Scholar
- F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, pages 219--222, 1997.Google Scholar
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1649--1652, 2009. Google ScholarDigital Library
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 831--838, 2009. Google ScholarDigital Library
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 13--24, 2011.Google ScholarCross Ref
- H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351--364, 2012. Google ScholarDigital Library
- H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 444--452, 2008. Google ScholarDigital Library
- L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics (ICSMC), The Hague, Netherlands, pages 1214--1219, 2004.Google ScholarCross Ref
- L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181--207, 2003. Google ScholarDigital Library
- A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 157--166, 2005. Google ScholarDigital Library
- F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3:1--39, 2012. Google ScholarDigital Library
- M. J. A. N. C. Marquis de Condorcet. Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. L'Imprimerie Royale, Paris, 1785.Google Scholar
- M. Meila. Comparing clusterings -- an axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, pages 577--584, 2005. Google ScholarDigital Library
- D. Moulavi, P. A. Jaskowiak, R. J. G. B. Campello, A. Zimek, and J. Sander. Density-based clustering validation. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.Google ScholarCross Ref
- E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: Grouping objects in different views of the data. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia, page 1220, 2010. Google ScholarDigital Library
- E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, pages 529--538, 2012. Google ScholarDigital Library
- E. Müller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, pages 434--445, 2011. Google ScholarDigital Library
- M. C. Naldi, A. C. P. L. F. Carvalho, and R. J. G. B. Campello. Cluster ensemble selection based on relative validity indexes. Data Mining and Knowledge Discovery, 27(2):259--289, 2013. Google ScholarDigital Library
- H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 368--383, 2010. Google ScholarDigital Library
- N. Nguyen and R. Caruana. Consensus clusterings. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 607--612, 2007. Google ScholarDigital Library
- S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 315--326, 2003.Google ScholarCross Ref
- D. Pfitzner, R. Leibbrandt, and D. Powers. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems (KAIS), 19(3):361--394, 2009. Google ScholarDigital Library
- N. Pham and R. Pagh. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, 2012. Google ScholarDigital Library
- Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717--726, 2009. Google ScholarDigital Library
- S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 427--438, 2000. Google ScholarDigital Library
- W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.Google ScholarCross Ref
- L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010. Google ScholarDigital Library
- E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pages 1047--1058, 2012.Google ScholarCross Ref
- E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.Google ScholarCross Ref
- E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 28(1):190--237, 2014. Google ScholarDigital Library
- K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332--397, 2013. Google ScholarDigital Library
- A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarDigital Library
- A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866--1881, 2005. Google ScholarDigital Library
- A. P. Topchy, M. H. C. Law, A. K. Jain, and A. L. Fred. Analysis of consensus partition in cluster ensemble. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 225--232, 2004. Google ScholarDigital Library
- G. Valentini and F. Masulli. Ensembles of learning machines. In Proceedings of the 13th Italian Workshop on Neural Nets, Vietri, Italy, pages 3--22, 2002. Google ScholarDigital Library
- L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--235, 2010. Google ScholarDigital Library
- L. Vendramin, P. A. Jaskowiak, and R. J. G. B. Campello. On the combination of relative clustering validity criteria. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), Baltimore, MD, pages 4:1--12, 2013. Google ScholarDigital Library
- J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 776--784, 2008. Google ScholarDigital Library
- K. Zhang, M. Hutter, and H. Jin. A new local distancebased outlier detection approach for scattered realworld data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 813--822, 2009. Google ScholarDigital Library
- A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, 2013. Google ScholarDigital Library
- A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363--387, 2012. Google ScholarDigital Library
- A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.Google Scholar
Index Terms
- Ensembles for unsupervised outlier detection: challenges and research questions a position paper
Recommendations
Outlier ensembles: position paper
Ensemble analysis is a widely used meta-algorithm for many data mining problems such as classification and clustering. Numerous ensemble-based algorithms have been proposed in the literature for these problems. Compared to the clustering and ...
An Unsupervised Boosting Strategy for Outlier Detection Ensembles
Advances in Knowledge Discovery and Data MiningAbstractEnsemble techniques have been applied to the unsupervised outlier detection problem in some scenarios. Challenges are the generation of diverse ensemble members and the combination of individual results into an ensemble. For the latter challenge, ...
Subsampling for efficient and effective unsupervised outlier detection ensembles
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningOutlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce ...
Comments