Skip to main content

2018 | OriginalPaper | Buchkapitel

A Fast DBSCAN Algorithm with Spark Implementation

verfasst von : Dianwei Han, Ankit Agrawal, Wei-keng Liao, Alok Choudhary

Erschienen in: Big Data in Engineering Applications

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agrawal, R., & Srikant, R. (1994). Quest synthetic data generator, IBM Almaden Research Center. Agrawal, R., & Srikant, R. (1994). Quest synthetic data generator, IBM Almaden Research Center.
2.
Zurück zum Zitat Beckmann, N., et al. (1990). The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Vol. 19, no. 2, pp. 323–331). Beckmann, N., et al. (1990). The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Vol. 19, no. 2, pp. 323–331).
3.
Zurück zum Zitat Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.CrossRefMATH Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.CrossRefMATH
4.
Zurück zum Zitat Brecheisen, S., et al. (2006). Parallel density-based clustering of complex objects. Advances in Knowledge Discovery and Data Mining, pp. 179–188. Brecheisen, S., et al. (2006). Parallel density-based clustering of complex objects. Advances in Knowledge Discovery and Data Mining, pp. 179–188.
6.
Zurück zum Zitat Ester, M., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Vol. 1996, pp. 226–231). AAAI Press. Ester, M., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Vol. 1996, pp. 226–231). AAAI Press.
7.
Zurück zum Zitat Fu, Y., et al. (2011). Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research 301, 1133–1138.CrossRef Fu, Y., et al. (2011). Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research 301, 1133–1138.CrossRef
8.
Zurück zum Zitat Han, J., et al. (2011). Data mining: Concepts and Techniques. Morgan Kaufmann. Han, J., et al. (2011). Data mining: Concepts and Techniques. Morgan Kaufmann.
9.
Zurück zum Zitat He, Y., et al. (2014). MR-DBSCAN: A scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83–99.MathSciNetCrossRef He, Y., et al. (2014). MR-DBSCAN: A scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83–99.MathSciNetCrossRef
11.
Zurück zum Zitat Kang, S. J., et al. (2015). Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Advances in Multimedia 2015.CrossRef Kang, S. J., et al. (2015). Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Advances in Multimedia 2015.CrossRef
12.
Zurück zum Zitat Karau, H., et al. (2015). Learning Spark: Lightning-fast Data Analysis. O’Reilly Media. Karau, H., et al. (2015). Learning Spark: Lightning-fast Data Analysis. O’Reilly Media.
13.
Zurück zum Zitat MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). USA. MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). USA.
14.
Zurück zum Zitat Noticewala, M., & Vaghela, D. (2014). MR-IDBSCAN: Efficient parallel incremental DBSCAN algorithm using mapreduce. International Journal of Computer Applications 93(4), 13–17.CrossRef Noticewala, M., & Vaghela, D. (2014). MR-IDBSCAN: Efficient parallel incremental DBSCAN algorithm using mapreduce. International Journal of Computer Applications 93(4), 13–17.CrossRef
15.
Zurück zum Zitat Patwary, M. M. A., et al. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 62:1–62:11. IEEE Computer Society Press. Patwary, M. M. A., et al. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 62:1–62:11. IEEE Computer Society Press.
16.
Zurück zum Zitat Pisharath, J., et al. (2010). NU-MineBench 3.0. Technical Report CUCIS-2005-08-01, Northwestern University (Technical Report). Pisharath, J., et al. (2010). NU-MineBench 3.0. Technical Report CUCIS-2005-08-01, Northwestern University (Technical Report).
17.
Zurück zum Zitat Sakr, S., & Gaber, M. M. (2014). Large Scale and Big Data: Processing and Management. CRC Press. Sakr, S., & Gaber, M. M. (2014). Large Scale and Big Data: Processing and Management. CRC Press.
19.
Zurück zum Zitat Sheikholeslami, G., et al. (2000). WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3), 289–304.CrossRef Sheikholeslami, G., et al. (2000). WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3), 289–304.CrossRef
20.
Zurück zum Zitat Tan, P., et al. (2005). Introduction to Data Mining. Pearson. Tan, P., et al. (2005). Introduction to Data Mining. Pearson.
21.
Zurück zum Zitat White, T. (2011). Hadoop: The Definitive Guide. O’Reilly Media. White, T. (2011). Hadoop: The Definitive Guide. O’Reilly Media.
22.
Zurück zum Zitat Zaharia, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2). USENIX Association. Zaharia, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2). USENIX Association.
23.
Zurück zum Zitat Zaharia, M. (2014). An Architecture for Fast and General Data Processing on Large Clusters. Technical Report UCB/EECS-2014-12, University of California, Berkeley (Technical Report). Zaharia, M. (2014). An Architecture for Fast and General Data Processing on Large Clusters. Technical Report UCB/EECS-2014-12, University of California, Berkeley (Technical Report).
24.
Zurück zum Zitat Zhang, T., et al. (1996). BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Record (Vol. 25, Issue. 2, pp. 103–114). ACM.CrossRef Zhang, T., et al. (1996). BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Record (Vol. 25, Issue. 2, pp. 103–114). ACM.CrossRef
25.
Zurück zum Zitat Zhou, et al. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 15(6), 509–526.CrossRefMATH Zhou, et al. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 15(6), 509–526.CrossRefMATH
Metadaten
Titel
A Fast DBSCAN Algorithm with Spark Implementation
verfasst von
Dianwei Han
Ankit Agrawal
Wei-keng Liao
Alok Choudhary
Copyright-Jahr
2018
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-8476-8_9

Premium Partner