skip to main content
research-article

Local search methods for k-means with outliers

Published:01 March 2017Publication History
Skip Abstract Section

Abstract

We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective.

Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.

References

  1. A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In RANDOM, pages 15--28, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD, pages 37--46, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, pages 164--169, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. PVLDB, 5(7):622--633, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Bohm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD, pages 185--198, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Record, 29(2):93--104, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642--651, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chawla and A. Gionis. k-means-: A unified approach to clustering and outlier detection. In ICDM, pages 189--197, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pages 226--231, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SOCG, pages 11--18, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.Google ScholarGoogle Scholar
  21. S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3--19, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3):89--112, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 19--22, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1+ϵ)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Lichman. UCI machine learning repository, 2013.Google ScholarGoogle Scholar
  29. S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. E. M. and N. R. T. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. In NIPS, pages 1063--1071, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX, pages 165--178, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Ott, L. Pang, F. T. Ramos, and S. Chawla. On integrated clustering and outlier detection. In NIPS, pages 1359--1367, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In ICDE, pages 315--326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  38. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD, pages 427--438, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Local search methods for k-means with outliers
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 10, Issue 7
      March 2017
      132 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 March 2017
      Published in pvldb Volume 10, Issue 7

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader