Abstract
We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective.
Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.
- A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In RANDOM, pages 15--28, 2009. Google ScholarDigital Library
- C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD, pages 37--46, 2001. Google ScholarDigital Library
- C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 2013. Google ScholarDigital Library
- A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, pages 164--169, 1996. Google ScholarDigital Library
- D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
- B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. PVLDB, 5(7):622--633, 2012. Google ScholarDigital Library
- S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003. Google ScholarDigital Library
- P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarCross Ref
- C. Bohm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD, pages 185--198, 2008. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Record, 29(2):93--104, 2000. Google ScholarDigital Library
- M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642--651, 2001. Google ScholarDigital Library
- S. Chawla and A. Gionis. k-means-: A unified approach to clustering and outlier detection. In ICDM, pages 189--197, 2013.Google ScholarCross Ref
- K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008. Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pages 226--231, 1996. Google ScholarDigital Library
- F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarDigital Library
- D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SOCG, pages 11--18, 2007. Google ScholarDigital Library
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarDigital Library
- S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarDigital Library
- A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.Google Scholar
- S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3--19, 2007.Google ScholarDigital Library
- A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3):89--112, 2004. Google ScholarDigital Library
- E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 19--22, 1997. Google ScholarDigital Library
- A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1+ϵ)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarDigital Library
- M. Lichman. UCI machine learning repository, 2013.Google Scholar
- S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982. Google ScholarDigital Library
- K. E. M. and N. R. T. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarDigital Library
- G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. In NIPS, pages 1063--1071, 2015. Google ScholarDigital Library
- R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX, pages 165--178, 2008. Google ScholarDigital Library
- C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarDigital Library
- C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarDigital Library
- C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarDigital Library
- L. Ott, L. Pang, F. T. Ramos, and S. Chawla. On integrated clustering and outlier detection. In NIPS, pages 1359--1367, 2014. Google ScholarDigital Library
- S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In ICDE, pages 315--326, 2003.Google ScholarCross Ref
- S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD, pages 427--438, 2000. Google ScholarDigital Library
- X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarDigital Library
Index Terms
- Local search methods for k-means with outliers
Recommendations
Relocating Local Outliers Produced by Partitioning Methods
ICCBD 2019: Proceedings of the 2nd International Conference on Computing and Big DataPartitioning methods is one of the most known and most used data clustering methods. The cluster center, known as centroid or medoid, represents each cluster that these methods produced. When the cluster center cannot describe all the data in the group, ...
Local Search Algorithm for the Spherical k-Means Problem with Outliers
Algorithmic Aspects in Information and ManagementAbstractWe study the spherical k-means problem with outliers, a variant of the classical k-means problem, in which data points are on the unit sphere and a small set of points called outliers (as a constraint, the number of outliers can not be greater ...
An approximation algorithm for the spherical k-means problem with outliers by local search
AbstractWe consider the spherical k-means problem with outliers, an extension of the k-means problem. In this clustering problem, all sample points are on the unit sphere. Given two integers k and z, we can ignore at most z points (outliers) and need to ...
Comments