research-article

Local search methods for k-means with outliers

Authors:
Shalmoli Gupta

University of Illinois

University of Illinois
View Profile

,
Ravi Kumar

Google

Google
View Profile

,
Kefu Lu

Washington University

Washington University
View Profile

,
Benjamin Moseley

Washington University

Washington University
View Profile

,
Sergei Vassilvitskii

Google

Google
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 7pp 757–768https://doi.org/10.14778/3067421.3067425

Published:01 March 2017Publication History

Proceedings of the VLDB Endowment

Abstract

We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective.

Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.

References

A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In RANDOM, pages 15--28, 2009. Google ScholarDigital Library
C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD, pages 37--46, 2001. Google ScholarDigital Library
C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman and Hall/CRC, 2013. Google ScholarDigital Library
A. Arning, R. Agrawal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD, pages 164--169, 1996. Google ScholarDigital Library
D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii. Scalable k-means++. PVLDB, 5(7):622--633, 2012. Google ScholarDigital Library
S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD, pages 29--38, 2003. Google ScholarDigital Library
P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarCross Ref
C. Bohm, C. Faloutsos, and C. Plant. Outlier-robust clustering using independent components. In SIGMOD, pages 185--198, 2008. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Record, 29(2):93--104, 2000. Google ScholarDigital Library
M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642--651, 2001. Google ScholarDigital Library
S. Chawla and A. Gionis. k-means-: A unified approach to clustering and outlier detection. In ICDM, pages 189--197, 2013.Google ScholarCross Ref
K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In SODA, pages 826--835, 2008. Google ScholarDigital Library
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In AAAI, pages 226--231, 1996. Google ScholarDigital Library
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarDigital Library
D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SOCG, pages 11--18, 2007. Google ScholarDigital Library
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarDigital Library
S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarDigital Library
A. Gupta and K. Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008.Google Scholar
S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry, 37(1):3--19, 2007.Google ScholarDigital Library
A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, pages 58--65, 1998. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2--3):89--112, 2004. Google ScholarDigital Library
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In KDD, pages 19--22, 1997. Google ScholarDigital Library
A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1+ϵ)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarDigital Library
M. Lichman. UCI machine learning repository, 2013.Google Scholar
S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129--136, 1982. Google ScholarDigital Library
K. E. M. and N. R. T. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999. Google ScholarDigital Library
G. Malkomes, M. Kusner, W. Chen, K. Weinberger, and B. Moseley. Fast distributed k-center clustering with outliers on massive data. In NIPS, pages 1063--1071, 2015. Google ScholarDigital Library
R. M. McCutchen and S. Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In APPROX, pages 165--178, 2008. Google ScholarDigital Library
C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarDigital Library
C. Ordonez and P. Cereghini. SQLEM: Fast clustering in SQL using the EM algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarDigital Library
C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarDigital Library
L. Ott, L. Pang, F. T. Ramos, and S. Chawla. On integrated clustering and outlier detection. In NIPS, pages 1359--1367, 2014. Google ScholarDigital Library
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In ICDE, pages 315--326, 2003.Google ScholarCross Ref
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD, pages 427--438, 2000. Google ScholarDigital Library
X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarDigital Library

Index Terms

Local search methods for k-means with outliers
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Relocating Local Outliers Produced by Partitioning Methods
ICCBD 2019: Proceedings of the 2nd International Conference on Computing and Big Data

Partitioning methods is one of the most known and most used data clustering methods. The cluster center, known as centroid or medoid, represents each cluster that these methods produced. When the cluster center cannot describe all the data in the group, ...
Read More
Local Search Algorithm for the Spherical k-Means Problem with Outliers
Algorithmic Aspects in Information and Management
Abstract
We study the spherical k-means problem with outliers, a variant of the classical k-means problem, in which data points are on the unit sphere and a small set of points called outliers (as a constraint, the number of outliers can not be greater ...
Read More
An approximation algorithm for the spherical k-means problem with outliers by local search
Abstract
We consider the spherical k-means problem with outliers, an extension of the k-means problem. In this clustering problem, all sample points are on the unit sphere. Given two integers k and z, we can ignore at most z points (outliers) and need to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 7
March 2017
132 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 March 2017
Published in pvldb Volume 10, Issue 7
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 535
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Local search methods for k-means with outliers

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Relocating Local Outliers Produced by Partitioning Methods

Local Search Algorithm for the Spherical k-Means Problem with Outliers

An approximation algorithm for the spherical k-means problem with outliers by local search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Local search methods for k-means with outliers

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Relocating Local Outliers Produced by Partitioning Methods

Local Search Algorithm for the Spherical k-Means Problem with Outliers

An approximation algorithm for the spherical k-means problem with outliers by local search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media