Abstract
We present NG-DBSCAN, an approximate density-based clustering algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of our algorithm makes it scalable to very large datasets; its approximate nature makes it fast, yet capable of producing high quality clustering results. We provide a detailed overview of the steps of NG-DBSCAN, together with their analysis. Our results, obtained through an extensive experimental campaign with real and synthetic data, substantiate our claims about NG-DBSCAN's performance and scalability.
- Apache Giraph. http://giraph.apache.org/.Google Scholar
- Apache Spark. https://spark.apache.org.Google Scholar
- Apache Spark machine learning library. https://spark.apache.org/mllib/.Google Scholar
- Clustering the News with Spark and MLLib. http://bigdatasciencebootcamp.com/posts/Part_3/clustering_news.html.Google Scholar
- Word2vector package. https://code.google.com/p/word2vec/.Google Scholar
- B.-R. Dai et al. Efficient map/reduce-based dbscan algorithm with optimized data partition. In Cloud Computing, IEEE 5th International Conference on, pages 59--66. IEEE, 2012. Google ScholarDigital Library
- B. Desgraupes. Clustering indices. In University of Paris Ouest-Lab ModalX, volume 1, page 34, 2013.Google Scholar
- W. Dong et al. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577--586. ACM, 2011. Google ScholarDigital Library
- M. Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996. Google ScholarDigital Library
- T. Falkowski et al. Dengraph: A density-based community detection algorithm. In Web Intelligence, IEEE/WIC/ACM International Conference on, pages 112--115. IEEE, 2007. Google ScholarDigital Library
- M. Filippone. Dealing with non-metric dissimilarities in fuzzy central clustering algorithms. International Journal of Approximate Reasoning, 50(2):363--384, 2009. Google ScholarDigital Library
- J. Gan and Y. Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 519--530. ACM, 2015. Google ScholarDigital Library
- J. E. Gonzalez et al. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarDigital Library
- Y. He et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, pages 473--480. IEEE, 2011. Google ScholarDigital Library
- M. A. Jaro. Probabilistic linkage of large public health data files. Statistics in medicine, 14(5-7):491--498, 1995.Google ScholarCross Ref
- Y. Kim, K. Shim, M.-S. Kim, and J. S. Lee. Dbcure-mr: an efficient density-based clustering algorithm for large data using mapreduce. Information Systems, 42:15--35, 2014. Google ScholarDigital Library
- H. Kwak et al. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591--600. ACM, 2010. Google ScholarDigital Library
- Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In ICDM, pages 911--916. IEEE, 2010. Google ScholarDigital Library
- A. Lulli, E. Carlini, P. Dazzi, C. Lucchese, and L. Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed systems, page 1, 2016.Google Scholar
- A. Lulli, T. Debatty, M. Dell'Amico, P. Michiardi, and L. Ricci. Scalable k-nn based text clustering. In Big Data (Big Data), 2015 IEEE International Conference on, pages 958--963. IEEE, 2015. Google ScholarDigital Library
- A. Lulli, L. Ricci, E. Carlini, P. Dazzi, and C. Lucchese. Cracker: Crumbling large graphs into connected components. In 2015 IEEE Symposium on Computers and Communication (ISCC), pages 574--581. IEEE, 2015. Google ScholarDigital Library
- G. Malewicz et al. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010. Google ScholarDigital Library
- R. R. McCune et al. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarDigital Library
- F. McSherry et al. Scalability! but at what cost? In 15th Workshop on Hot Topics in Operating Systems, 2015. Google ScholarDigital Library
- M. M. A. Patwary et al. Pardicle: parallel approximate density-based clustering. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 560--571. IEEE Press, 2014. Google ScholarDigital Library
- F. Pedregosa et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825--2830, 2011. Google ScholarDigital Library
- L. M. Rocha, F. A. Cappabianco, and A. X. Falcão. Data clustering as an optimum-path forest problem with applications in image analysis. International Journal of Imaging Systems and Technology, 19(2):50--68, 2009. Google ScholarDigital Library
- T. N. Tran et al. Knn-kernel density-based clustering for high-dimensional multivariate data. Computational Statistics & Data Analysis, 51(2):513--525, 2006. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- B. Welton et al. Mr. scan: Extreme scale density-based clustering using a tree-based network of gpgpu nodes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 84. ACM, 2013. Google ScholarDigital Library
- X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining, pages 263--290. Springer, 1999. Google ScholarDigital Library
- S. Zhou et al. A neighborhood-based clustering algorithm. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 361--371. 2005. Google ScholarDigital Library
Index Terms
- NG-DBSCAN: scalable density-based clustering for arbitrary data
Recommendations
AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities
Clustering is a typical data mining technique that partitions a dataset into multiple subsets of similar objects according to similarity metrics. In particular, density-based algorithms can find clusters of different shapes and sizes while remaining ...
Restricted Randomness DBSCAN : A faster DBSCAN Algorithm
IC3-2021: Proceedings of the 2021 Thirteenth International Conference on Contemporary ComputingData Mining is the process of extracting useful and accurate information or patterns from large databases using different algorithms and methods of machine learning. To analyze the data, Clustering is one of the methods in which similar data is grouped ...
Rough-DBSCAN: A fast hybrid density based clustering method for large data sets
Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clusters along with noisy outliers. Its time requirement is O(n^2) where n is the size of the dataset, and because of this it is not a suitable one to ...
Comments