skip to main content
column

Ensembles for unsupervised outlier detection: challenges and research questions a position paper

Published:17 March 2014Publication History
Skip Abstract Section

Abstract

Ensembles for unsupervised outlier detection is an emerging topic that has been neglected for a surprisingly long time (although there are reasons why this is more difficult than supervised ensembles or even clustering ensembles). Aggarwal recently discussed algorithmic patterns of outlier detection ensembles, identified traces of the idea in the literature, and remarked on potential as well as unlikely avenues for future transfer of concepts from supervised ensembles. Complementary to his points, here we focus on the core ingredients for building an outlier ensemble, discuss the first steps taken in the literature, and identify challenges for future research.

References

  1. N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504--509, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. C. Aggarwal. Outlier ensembles {position paper}. ACM SIGKDD Explorations, 14(2):49--58, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal. Outlier Analysis. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Helsinki, Finland, pages 15--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Ayad and M. Kamel. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In 4th International Workshop on Multiple Classifier Systems (MCS), Guildford, UK, pages 166--175, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Azimi and X. Fern. Adaptive cluster ensemble selection. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena, CA, pages 992--997, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Bache and M. Lichman. UCI machine learning repository, 2013.Google ScholarGoogle Scholar
  8. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3rd edition, 1994.Google ScholarGoogle Scholar
  9. M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 93--104, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  11. R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 107--118, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):Article 15, 1--58, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Transactions on Knowledge and Data Engineering, 24(5):823--839, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL, 2014.Google ScholarGoogle Scholar
  15. X. H. Dang, B. Micenkova, I. Assent, and R. Ng. Outlier detection with space transformation and spectral analysis. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 225--233, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. G. Dietterich. Ensemble methods in machine learning. In First International Workshop on Multiple Classifier Systems (MCS), Cagliari, Italy, pages 1--15, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, pages 105--112, 1996.Google ScholarGoogle Scholar
  18. A. F. Emmott, S. Das, T. Dietterich, A. Fern, and W.-K. Wong. Systematic construction of anomaly detection benchmarks from real data. In Workshop on Outlier Detection and Description, held in conjunction with the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC, pages 186--193, 2003.Google ScholarGoogle Scholar
  20. X. Z. Fern and W. Lin. Cluster ensemble selection. Statistical Analysis and Data Mining, 1(3):128--141, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. L. N. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835--850, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Färber, S. Günnemann, H.-P. Kriegel, P. Kröger, E. Müller, E. Schubert, T. Seidl, and A. Zimek. On using class-labels in evaluation of clusterings. In MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC, 2010.Google ScholarGoogle Scholar
  23. J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China, pages 212--221, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. E. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11(1):1--21, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  27. S. T. Hadjitodorov and L. I. Kuncheva. Selecting diversifying heuristics for cluster ensembles. In 7th International Workshop on Multiple Classifier Systems (MCS), Prague, Czech Republic, pages 200--209, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:29--36, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  30. D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. S. Hossain, S. Tadepalli, L. T. Watson, I. Davidson, R. F. Helm, and N. Ramakrishnan. Unifying dependent clustering and disparate clustering for nonhomogeneous data. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 593--602, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  33. N. Iam-On and T. Boongoen. Comparative study of matrix refinement approaches for ensemble clustering. Machine Learning, 2013.Google ScholarGoogle Scholar
  34. W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, pages 577--593, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley & Sons, 1990.Google ScholarGoogle Scholar
  36. F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 28th International Conference on Data Engineering (ICDE), Washington, DC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, pages 219--222, 1997.Google ScholarGoogle Scholar
  38. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, pages 1649--1652, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 831--838, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 13--24, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  41. H.-P. Kriegel, P. Kröger, and A. Zimek. Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(4):351--364, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 444--452, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics (ICSMC), The Hague, Netherlands, pages 1214--1219, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  44. L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181--207, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedings of the 11th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, pages 157--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3:1--39, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. J. A. N. C. Marquis de Condorcet. Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. L'Imprimerie Royale, Paris, 1785.Google ScholarGoogle Scholar
  48. M. Meila. Comparing clusterings -- an axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany, pages 577--584, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. D. Moulavi, P. A. Jaskowiak, R. J. G. B. Campello, A. Zimek, and J. Sander. Density-based clustering validation. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  50. E. Müller, S. Günnemann, I. Färber, and T. Seidl. Discovering multiple clustering solutions: Grouping objects in different views of the data. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), Sydney, Australia, page 1220, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm. Outlier ranking via subspace analysis in multiple views of the data. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, pages 529--538, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. E. Müller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, pages 434--445, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. C. Naldi, A. C. P. L. F. Carvalho, and R. J. G. B. Campello. Cluster ensemble selection based on relative validity indexes. Data Mining and Knowledge Discovery, 27(2):259--289, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, pages 368--383, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. N. Nguyen and R. Caruana. Consensus clusterings. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), Omaha, NE, pages 607--612, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering (ICDE), Bangalore, India, pages 315--326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  57. D. Pfitzner, R. Leibbrandt, and D. Powers. Characterization and evaluation of similarity measures for pairs of clusterings. Knowledge and Information Systems (KAIS), 19(3):361--394, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. N. Pham and R. Pagh. A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Z. J. Qi and I. Davidson. A principled and flexible framework for finding alternative clusterings. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 717--726, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 427--438, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846--850, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  62. L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, pages 1047--1058, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  64. E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  65. E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery, 28(1):190--237, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong. A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery, 26(2):332--397, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. A. Strehl and J. Ghosh. Cluster ensembles -- a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866--1881, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. A. P. Topchy, M. H. C. Law, A. K. Jain, and A. L. Fred. Analysis of consensus partition in cluster ensemble. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Brighton, UK, pages 225--232, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. G. Valentini and F. Masulli. Ensembles of learning machines. In Proceedings of the 13th Italian Workshop on Neural Nets, Vietri, Italy, pages 3--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--235, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. L. Vendramin, P. A. Jaskowiak, and R. J. G. B. Campello. On the combination of relative clustering validity criteria. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), Baltimore, MD, pages 4:1--12, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas, NV, pages 776--784, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. K. Zhang, M. Hutter, and H. Jin. A new local distancebased outlier detection approach for scattered realworld data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, pages 813--822, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Chicago, IL, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5):363--387, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. A. Zimek and J. Vreeken. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Ensembles for unsupervised outlier detection: challenges and research questions a position paper

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGKDD Explorations Newsletter
        ACM SIGKDD Explorations Newsletter  Volume 15, Issue 1
        June 2013
        50 pages
        ISSN:1931-0145
        EISSN:1931-0153
        DOI:10.1145/2594473
        Issue’s Table of Contents

        Copyright © 2014 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 March 2014

        Check for updates

        Qualifiers

        • column

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader