research-article

NG-DBSCAN: scalable density-based clustering for arbitrary data

Authors:
Alessandro Lulli

University of Pisa, Italy and ISTI, CNR, Pisa, Italy

University of Pisa, Italy and ISTI, CNR, Pisa, Italy
View Profile

,
Matteo Dell'Amico

Symantec Research Labs, France

Symantec Research Labs, France
View Profile

,
Pietro Michiardi

EURECOM, Campus SophiaTech, France

EURECOM, Campus SophiaTech, France
View Profile

,
Laura Ricci

University of Pisa, Italy and ISTI, CNR, Pisa, Italy

University of Pisa, Italy and ISTI, CNR, Pisa, Italy
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 3pp 157–168https://doi.org/10.14778/3021924.3021932

Published:01 November 2016Publication History

Proceedings of the VLDB Endowment

Abstract

We present NG-DBSCAN, an approximate density-based clustering algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of our algorithm makes it scalable to very large datasets; its approximate nature makes it fast, yet capable of producing high quality clustering results. We provide a detailed overview of the steps of NG-DBSCAN, together with their analysis. Our results, obtained through an extensive experimental campaign with real and synthetic data, substantiate our claims about NG-DBSCAN's performance and scalability.

References

Apache Giraph. http://giraph.apache.org/.Google Scholar
Apache Spark. https://spark.apache.org.Google Scholar
Apache Spark machine learning library. https://spark.apache.org/mllib/.Google Scholar
Clustering the News with Spark and MLLib. http://bigdatasciencebootcamp.com/posts/Part_3/clustering_news.html.Google Scholar
Word2vector package. https://code.google.com/p/word2vec/.Google Scholar
B.-R. Dai et al. Efficient map/reduce-based dbscan algorithm with optimized data partition. In Cloud Computing, IEEE 5th International Conference on, pages 59--66. IEEE, 2012. Google ScholarDigital Library
B. Desgraupes. Clustering indices. In University of Paris Ouest-Lab ModalX, volume 1, page 34, 2013.Google Scholar
W. Dong et al. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577--586. ACM, 2011. Google ScholarDigital Library
M. Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996. Google ScholarDigital Library
T. Falkowski et al. Dengraph: A density-based community detection algorithm. In Web Intelligence, IEEE/WIC/ACM International Conference on, pages 112--115. IEEE, 2007. Google ScholarDigital Library
M. Filippone. Dealing with non-metric dissimilarities in fuzzy central clustering algorithms. International Journal of Approximate Reasoning, 50(2):363--384, 2009. Google ScholarDigital Library
J. Gan and Y. Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 519--530. ACM, 2015. Google ScholarDigital Library
J. E. Gonzalez et al. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarDigital Library
Y. He et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, pages 473--480. IEEE, 2011. Google ScholarDigital Library
M. A. Jaro. Probabilistic linkage of large public health data files. Statistics in medicine, 14(5-7):491--498, 1995.Google ScholarCross Ref
Y. Kim, K. Shim, M.-S. Kim, and J. S. Lee. Dbcure-mr: an efficient density-based clustering algorithm for large data using mapreduce. Information Systems, 42:15--35, 2014. Google ScholarDigital Library
H. Kwak et al. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591--600. ACM, 2010. Google ScholarDigital Library
Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In ICDM, pages 911--916. IEEE, 2010. Google ScholarDigital Library
A. Lulli, E. Carlini, P. Dazzi, C. Lucchese, and L. Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed systems, page 1, 2016.Google Scholar
A. Lulli, T. Debatty, M. Dell'Amico, P. Michiardi, and L. Ricci. Scalable k-nn based text clustering. In Big Data (Big Data), 2015 IEEE International Conference on, pages 958--963. IEEE, 2015. Google ScholarDigital Library
A. Lulli, L. Ricci, E. Carlini, P. Dazzi, and C. Lucchese. Cracker: Crumbling large graphs into connected components. In 2015 IEEE Symposium on Computers and Communication (ISCC), pages 574--581. IEEE, 2015. Google ScholarDigital Library
G. Malewicz et al. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010. Google ScholarDigital Library
R. R. McCune et al. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarDigital Library
F. McSherry et al. Scalability! but at what cost? In 15th Workshop on Hot Topics in Operating Systems, 2015. Google ScholarDigital Library
M. M. A. Patwary et al. Pardicle: parallel approximate density-based clustering. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 560--571. IEEE Press, 2014. Google ScholarDigital Library
F. Pedregosa et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825--2830, 2011. Google ScholarDigital Library
L. M. Rocha, F. A. Cappabianco, and A. X. Falcão. Data clustering as an optimum-path forest problem with applications in image analysis. International Journal of Imaging Systems and Technology, 19(2):50--68, 2009. Google ScholarDigital Library
T. N. Tran et al. Knn-kernel density-based clustering for high-dimensional multivariate data. Computational Statistics & Data Analysis, 51(2):513--525, 2006. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
B. Welton et al. Mr. scan: Extreme scale density-based clustering using a tree-based network of gpgpu nodes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 84. ACM, 2013. Google ScholarDigital Library
X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining, pages 263--290. Springer, 1999. Google ScholarDigital Library
S. Zhou et al. A neighborhood-based clustering algorithm. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 361--371. 2005. Google ScholarDigital Library

Index Terms

NG-DBSCAN: scalable density-based clustering for arbitrary data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Read More
Restricted Randomness DBSCAN : A faster DBSCAN Algorithm
IC3-2021: Proceedings of the 2021 Thirteenth International Conference on Contemporary Computing

Data Mining is the process of extracting useful and accurate information or patterns from large databases using different algorithms and methods of machine learning. To analyze the data, Clustering is one of the methods in which similar data is grouped ...
Read More
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clusters along with noisy outliers. Its time requirement is O(n^2) where n is the size of the dataset, and because of this it is not a suitable one to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 3
November 2016
216 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 November 2016
Published in pvldb Volume 10, Issue 3
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 442
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NG-DBSCAN: scalable density-based clustering for arbitrary data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Restricted Randomness DBSCAN : A faster DBSCAN Algorithm

Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

NG-DBSCAN: scalable density-based clustering for arbitrary data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities

Restricted Randomness DBSCAN : A faster DBSCAN Algorithm

Rough-DBSCAN: A fast hybrid density based clustering method for large data sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media