skip to main content
research-article

NG-DBSCAN: scalable density-based clustering for arbitrary data

Published:01 November 2016Publication History
Skip Abstract Section

Abstract

We present NG-DBSCAN, an approximate density-based clustering algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of our algorithm makes it scalable to very large datasets; its approximate nature makes it fast, yet capable of producing high quality clustering results. We provide a detailed overview of the steps of NG-DBSCAN, together with their analysis. Our results, obtained through an extensive experimental campaign with real and synthetic data, substantiate our claims about NG-DBSCAN's performance and scalability.

References

  1. Apache Giraph. http://giraph.apache.org/.Google ScholarGoogle Scholar
  2. Apache Spark. https://spark.apache.org.Google ScholarGoogle Scholar
  3. Apache Spark machine learning library. https://spark.apache.org/mllib/.Google ScholarGoogle Scholar
  4. Clustering the News with Spark and MLLib. http://bigdatasciencebootcamp.com/posts/Part_3/clustering_news.html.Google ScholarGoogle Scholar
  5. Word2vector package. https://code.google.com/p/word2vec/.Google ScholarGoogle Scholar
  6. B.-R. Dai et al. Efficient map/reduce-based dbscan algorithm with optimized data partition. In Cloud Computing, IEEE 5th International Conference on, pages 59--66. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Desgraupes. Clustering indices. In University of Paris Ouest-Lab ModalX, volume 1, page 34, 2013.Google ScholarGoogle Scholar
  8. W. Dong et al. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577--586. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Ester et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Falkowski et al. Dengraph: A density-based community detection algorithm. In Web Intelligence, IEEE/WIC/ACM International Conference on, pages 112--115. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Filippone. Dealing with non-metric dissimilarities in fuzzy central clustering algorithms. International Journal of Approximate Reasoning, 50(2):363--384, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gan and Y. Tao. Dbscan revisited: mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 519--530. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. E. Gonzalez et al. Graphx: Graph processing in a distributed dataflow framework. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. He et al. Mr-dbscan: an efficient parallel density-based clustering algorithm using mapreduce. In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, pages 473--480. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. Jaro. Probabilistic linkage of large public health data files. Statistics in medicine, 14(5-7):491--498, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  16. Y. Kim, K. Shim, M.-S. Kim, and J. S. Lee. Dbcure-mr: an efficient density-based clustering algorithm for large data using mapreduce. Information Systems, 42:15--35, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Kwak et al. What is twitter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, pages 591--600. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In ICDM, pages 911--916. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Lulli, E. Carlini, P. Dazzi, C. Lucchese, and L. Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed systems, page 1, 2016.Google ScholarGoogle Scholar
  20. A. Lulli, T. Debatty, M. Dell'Amico, P. Michiardi, and L. Ricci. Scalable k-nn based text clustering. In Big Data (Big Data), 2015 IEEE International Conference on, pages 958--963. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Lulli, L. Ricci, E. Carlini, P. Dazzi, and C. Lucchese. Cracker: Crumbling large graphs into connected components. In 2015 IEEE Symposium on Computers and Communication (ISCC), pages 574--581. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Malewicz et al. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135--146. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. R. McCune et al. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. McSherry et al. Scalability! but at what cost? In 15th Workshop on Hot Topics in Operating Systems, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. M. A. Patwary et al. Pardicle: parallel approximate density-based clustering. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 560--571. IEEE Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Pedregosa et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825--2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. M. Rocha, F. A. Cappabianco, and A. X. Falcão. Data clustering as an optimum-path forest problem with applications in image analysis. International Journal of Imaging Systems and Technology, 19(2):50--68, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. N. Tran et al. Knn-kernel density-based clustering for high-dimensional multivariate data. Computational Statistics & Data Analysis, 51(2):513--525, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Welton et al. Mr. scan: Extreme scale density-based clustering using a tree-based network of gpgpu nodes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 84. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. In High Performance Data Mining, pages 263--290. Springer, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Zhou et al. A neighborhood-based clustering algorithm. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 361--371. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. NG-DBSCAN: scalable density-based clustering for arbitrary data
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 10, Issue 3
        November 2016
        216 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 November 2016
        Published in pvldb Volume 10, Issue 3

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader