nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

A Fast DBSCAN Algorithm with Spark Implementation

verfasst von : Dianwei Han, Ankit Agrawal, Wei-keng Liao, Alok Choudhary

Erschienen in: Big Data in Engineering Applications

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. Parallelization of DBSCAN is a challenging work because there is an inherent sequential data access order and based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using Spark. kd-tree technique is applied in our algorithm to reduce search time. More specifically, a novel merge approach is used so that no communication between executors is required while partial clusters are generated. Appropriate and efficient data structures are carefully used in our study: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. Also other advanced data structures from Spark are applied to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10 k), 10 using 32 cores (100 k), and 137 using 512 cores (1 m). Another experiment using 10 k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Facing Up to Nomophobia: A Systematic Review of Mobile Phone Apps that Reduce Smartphone Usage

Nächstes Kapitel Understanding How Big Data Leads to Social Networking Vulnerability

Agrawal, R., & Srikant, R. (1994). Quest synthetic data generator, IBM Almaden Research Center.

Beckmann, N., et al. (1990). The r*-tree: An efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (Vol. 19, no. 2, pp. 323–331).

Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517.CrossRefMATH

Brecheisen, S., et al. (2006). Parallel density-based clustering of complex objects. Advances in Knowledge Discovery and Data Mining, pp. 179–188.

DOE Office of Science (2015, September 17). Edison Configuration (Online). https://www.nersc.gov/users/computational-systems/edison/configuration/.

Ester, M., et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Vol. 1996, pp. 226–231). AAAI Press.

Fu, Y., et al. (2011). Research on parallel DBSCAN algorithm design based on mapreduce. Advanced Materials Research 301, 1133–1138.CrossRef

Han, J., et al. (2011). Data mining: Concepts and Techniques. Morgan Kaufmann.

He, Y., et al. (2014). MR-DBSCAN: A scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 8(1), 83–99.MathSciNetCrossRef

10.

Kakde, H. M. (2005, August 25). Range Searching using Kd Tree (Online). http://www.cs.utah.edu/lifeifei/cs6931/kdtree.pdf.

11.

Kang, S. J., et al. (2015). Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Advances in Multimedia 2015.CrossRef

12.

Karau, H., et al. (2015). Learning Spark: Lightning-fast Data Analysis. O’Reilly Media.

13.

MacQueen, J., et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297). USA.

14.

Noticewala, M., & Vaghela, D. (2014). MR-IDBSCAN: Efficient parallel incremental DBSCAN algorithm using mapreduce. International Journal of Computer Applications 93(4), 13–17.CrossRef

15.

Patwary, M. M. A., et al. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 62:1–62:11. IEEE Computer Society Press.

16.

Pisharath, J., et al. (2010). NU-MineBench 3.0. Technical Report CUCIS-2005-08-01, Northwestern University (Technical Report).

17.

Sakr, S., & Gaber, M. M. (2014). Large Scale and Big Data: Processing and Management. CRC Press.

18.

Spark, A. (2015). Spark Programming Guide (Online). http://spark.apache.org/docs/latest/programming-guide.html.

19.

Sheikholeslami, G., et al. (2000). WaveCluster: A wavelet based clustering approach for spatial data in very large databases. The VLDB Journal, 8(3), 289–304.CrossRef

20.

Tan, P., et al. (2005). Introduction to Data Mining. Pearson.

21.

White, T. (2011). Hadoop: The Definitive Guide. O’Reilly Media.

22.

Zaharia, M., et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2–2). USENIX Association.

23.

Zaharia, M. (2014). An Architecture for Fast and General Data Processing on Large Clusters. Technical Report UCB/EECS-2014-12, University of California, Berkeley (Technical Report).

24.

Zhang, T., et al. (1996). BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Record (Vol. 25, Issue. 2, pp. 103–114). ACM.CrossRef

25.

Zhou, et al. (2000). Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 15(6), 509–526.CrossRefMATH

Titel: A Fast DBSCAN Algorithm with Spark Implementation
verfasst von: Dianwei Han
Ankit Agrawal
Wei-keng Liao
Alok Choudhary
Verlag: Springer Singapore
Buch: Big Data in Engineering Applications
Print ISBN: 978-981-10-8475-1

Electronic ISBN: 978-981-10-8476-8

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-981-10-8476-8_9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner