Skip to main content
Erschienen in: The Journal of Supercomputing 6/2021

26.11.2020

K-DBSCAN: An improved DBSCAN algorithm for big data

verfasst von: Nahid Gholizadeh, Hamid Saadatfar, Nooshin Hanafi

Erschienen in: The Journal of Supercomputing | Ausgabe 6/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Big data storage and processing are among the most important challenges now. Among data mining algorithms, DBSCAN is a common clustering method. One of the most important drawbacks of this algorithm is its low execution speed. This study aims to accelerate the DBSCAN execution speed so that the algorithm can respond to big datasets in an acceptable period of time. To overcome the problem, an initial grouping was applied to the data in this article through the K-means++ algorithm. DBSCAN was then employed to perform clustering in each group separately. As a result, the computational burden of DBSCAN execution reduced and the clustering execution speed increased significantly. Finally, border clusters were merged if necessary. According to the results of executing the proposed algorithm, it managed to greatly reduce the DBSCAN execution time (98% in the best-case scenario) with no significant changes in the qualitative evaluation criteria for clustering.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Storey V, Song I (2017) Big data technologies and management: What conceptual modeling can do. Data KnowlEng 108:50–67CrossRef Storey V, Song I (2017) Big data technologies and management: What conceptual modeling can do. Data KnowlEng 108:50–67CrossRef
2.
Zurück zum Zitat Ianni M, Masciari E, Mazzeo G, Zaniolo C (2018) Efficient big data clustering. In: 22nd International Database Engineering & Applications Symposium, pp 103–109. ACM Ianni M, Masciari E, Mazzeo G, Zaniolo C (2018) Efficient big data clustering. In: 22nd International Database Engineering & Applications Symposium, pp 103–109. ACM
3.
Zurück zum Zitat Arora P, Deepali D, Varshney S (2016) Analysis of K-Means and K-Medoids algorithm for big data. ProcediaCompuSci 78:507–512 Arora P, Deepali D, Varshney S (2016) Analysis of K-Means and K-Medoids algorithm for big data. ProcediaCompuSci 78:507–512
4.
Zurück zum Zitat Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM ComputSurv 31(3):264–323 Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM ComputSurv 31(3):264–323
5.
Zurück zum Zitat Zhu J, Zeng M, Huang J, Liao S, Cai C, Zheng L (2020) Vehicle re-identification using quadruple directional deep learning features. IEEE Trans IntellTranspSyst 21(1):410–420 Zhu J, Zeng M, Huang J, Liao S, Cai C, Zheng L (2020) Vehicle re-identification using quadruple directional deep learning features. IEEE Trans IntellTranspSyst 21(1):410–420
6.
Zurück zum Zitat Liu S, Liu M, Li P, Zhao J, Zhu Z, Wang X (2017) SAR image denoising via sparse representation in Shearlet domain based on continuous cycle spinning. IEEE Trans Geosci Remote Sens 55(5):2985–2992CrossRef Liu S, Liu M, Li P, Zhao J, Zhu Z, Wang X (2017) SAR image denoising via sparse representation in Shearlet domain based on continuous cycle spinning. IEEE Trans Geosci Remote Sens 55(5):2985–2992CrossRef
7.
Zurück zum Zitat Pei S, Shen T, Wang X, Gu C, Ning Z, Ye X, Xiong N (2020) 3DACN: 3D augmented convolutional network for time series data. InfSci 513:17–29 Pei S, Shen T, Wang X, Gu C, Ning Z, Ye X, Xiong N (2020) 3DACN: 3D augmented convolutional network for time series data. InfSci 513:17–29
8.
Zurück zum Zitat Qiao S, Li T, Li H, Peng J, Chen H (2012) A new blockmodeling based hierarchical clustering algorithm for web social networks. EngApplArtifIntell 25(3):640–647 Qiao S, Li T, Li H, Peng J, Chen H (2012) A new blockmodeling based hierarchical clustering algorithm for web social networks. EngApplArtifIntell 25(3):640–647
9.
Zurück zum Zitat Che D, Safran M, Peng Z (2013) From big data to big data mining: challenges, issues, and opportunities. In: International Conference on Database Systems for Advanced Applications, pp 1–15. Springer Che D, Safran M, Peng Z (2013) From big data to big data mining: challenges, issues, and opportunities. In: International Conference on Database Systems for Advanced Applications, pp 1–15. Springer
10.
Zurück zum Zitat Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd International Conference on Knowledge Discovery and Data Mining, pp 226–231. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd International Conference on Knowledge Discovery and Data Mining, pp 226–231.
11.
Zurück zum Zitat David A, Sergei V (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035. ACM David A, Sergei V (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035. ACM
12.
Zurück zum Zitat Katal A, Wazid M, and Goudar R (2013) Big data: Issues, challenges, tools and good practices. In: 2013 6th International Conference on Contemporary Computing, pp 404–409. IEEE Katal A, Wazid M, and Goudar R (2013) Big data: Issues, challenges, tools and good practices. In: 2013 6th International Conference on Contemporary Computing, pp 404–409. IEEE
13.
Zurück zum Zitat Shahrivari S (2014) Beyond batch processing: Towards real-time and streaming big data. Computers 3(4):117–129CrossRef Shahrivari S (2014) Beyond batch processing: Towards real-time and streaming big data. Computers 3(4):117–129CrossRef
14.
Zurück zum Zitat Chen M, Mao S, Liu Y (2014) Big data: A survey. Mob NetwAppl 19(2):171–209CrossRef Chen M, Mao S, Liu Y (2014) Big data: A survey. Mob NetwAppl 19(2):171–209CrossRef
15.
Zurück zum Zitat Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: International Conference on Computational Science and its Applications, pp 707–720. Springer Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: International Conference on Computational Science and its Applications, pp 707–720. Springer
16.
Zurück zum Zitat LIU B (2006) A fast density-based clustering algorithm for large databases. In: 2006 International Conference on Machine Learning and Cybernetics, pp 996–1000. IEEE LIU B (2006) A fast density-based clustering algorithm for large databases. In: 2006 International Conference on Machine Learning and Cybernetics, pp 996–1000. IEEE
17.
Zurück zum Zitat Wu Y, Guo J, ZHANG X (2007) A linear DBSCAN algorithm based on LSH. In: 2007 International Conference on Machine Learning and Cybernetics, Vol 5, pp 2608–2614. IEEE Wu Y, Guo J, ZHANG X (2007) A linear DBSCAN algorithm based on LSH. In: 2007 International Conference on Machine Learning and Cybernetics, Vol 5, pp 2608–2614. IEEE
18.
Zurück zum Zitat Dogan Y, Birant D, Kut A (2013) SOM++: Integration of self-organizing map and K-Means++ algorithms. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp 246–259. Springer Dogan Y, Birant D, Kut A (2013) SOM++: Integration of self-organizing map and K-Means++ algorithms. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp 246–259. Springer
19.
Zurück zum Zitat Bakr A, Ghanem N, Ismail M (2015) Efficient incremental density-based algorithm for clustering large datasets. Alex Eng J 54(4):1147–1154CrossRef Bakr A, Ghanem N, Ismail M (2015) Efficient incremental density-based algorithm for clustering large datasets. Alex Eng J 54(4):1147–1154CrossRef
20.
Zurück zum Zitat Xu T, Chiang H, Liu G, Tan C (2015) Hierarchical K-means method for clustering large-scale advanced metering infrastructure data. IEEE Trans Power Delivery 32(2):609–616CrossRef Xu T, Chiang H, Liu G, Tan C (2015) Hierarchical K-means method for clustering large-scale advanced metering infrastructure data. IEEE Trans Power Delivery 32(2):609–616CrossRef
21.
Zurück zum Zitat Ismkhan H (2018) I-k-means++: An iterative clustering algorithm based on an enhanced version of the k-means. PattRecogn 79:402–413 Ismkhan H (2018) I-k-means++: An iterative clustering algorithm based on an enhanced version of the k-means. PattRecogn 79:402–413
22.
Zurück zum Zitat Brown D, Japa A, Shi Y (2019) A fast density-grid based clustering method Daniel Brown. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference, pp 0048–0054. IEEE Brown D, Japa A, Shi Y (2019) A fast density-grid based clustering method Daniel Brown. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference, pp 0048–0054. IEEE
23.
Zurück zum Zitat Mathur V, Mehta J, Singh S (2019) "HCA-DBSCAN: HyperCube accelerated density based spatial clustering for applications with noise," in 33rd Conference on Neural Information Processing Systems (arXiv preprint). Mathur V, Mehta J, Singh S (2019) "HCA-DBSCAN: HyperCube accelerated density based spatial clustering for applications with noise," in 33rd Conference on Neural Information Processing Systems (arXiv preprint).
24.
Zurück zum Zitat Luchi D, Rodrigues A, Varejao F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern RecognLett 117:90–96CrossRef Luchi D, Rodrigues A, Varejao F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern RecognLett 117:90–96CrossRef
25.
Zurück zum Zitat Chen Y, Zhou L, Pei S, Yu Z, Chen Y, Liu X, Du J, Xiong N (2019) KNN-BLOCK DBSCAN Fast Clustering for Large-Scale Data. IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp 1–15. Chen Y, Zhou L, Pei S, Yu Z, Chen Y, Liu X, Du J, Xiong N (2019) KNN-BLOCK DBSCAN Fast Clustering for Large-Scale Data. IEEE Transactions on Systems, Man, and Cybernetics: Systems, pp 1–15.
26.
Zurück zum Zitat Chen Y, Zhou L, Bouguila N, Wang C, Chen Y, Du J (2020) BLOCK-DBSCAN Fast clustering for large scale data. PattRecogn 109:107627 Chen Y, Zhou L, Bouguila N, Wang C, Chen Y, Du J (2020) BLOCK-DBSCAN Fast clustering for large scale data. PattRecogn 109:107627
27.
Zurück zum Zitat Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercompu 70(3):1249–1259CrossRef Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercompu 70(3):1249–1259CrossRef
28.
Zurück zum Zitat Sinha A, Jana P (2016) A novel K-means based clustering algorithm for big data. In: Conference on Advances in Computing, Communications and Informatics, pp 1875–1879. IEEE Sinha A, Jana P (2016) A novel K-means based clustering algorithm for big data. In: Conference on Advances in Computing, Communications and Informatics, pp 1875–1879. IEEE
29.
Zurück zum Zitat Song H, Lee J (2018) RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In: 2018 International Conference on Management of Data, pp 1173–1187. Song H, Lee J (2018) RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In: 2018 International Conference on Management of Data, pp 1173–1187.
30.
Zurück zum Zitat Li S (2020) An Improved DBSCAN Algorithm Based on the Neighbor Similarity and Fast Nearest Neighbor Query. IEEE Access 8:47468–47476CrossRef Li S (2020) An Improved DBSCAN Algorithm Based on the Neighbor Similarity and Fast Nearest Neighbor Query. IEEE Access 8:47468–47476CrossRef
31.
Zurück zum Zitat José-García A, Gómez-Flores W (2016) Automatic clustering using nature-inspired metaheuristics: A survey. Appl Soft Compu 41:192–213CrossRef José-García A, Gómez-Flores W (2016) Automatic clustering using nature-inspired metaheuristics: A survey. Appl Soft Compu 41:192–213CrossRef
32.
Zurück zum Zitat Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227CrossRef Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227CrossRef
33.
Zurück zum Zitat Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J ComputAppl Math 20:53–65CrossRef Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J ComputAppl Math 20:53–65CrossRef
Metadaten
Titel
K-DBSCAN: An improved DBSCAN algorithm for big data
verfasst von
Nahid Gholizadeh
Hamid Saadatfar
Nooshin Hanafi
Publikationsdatum
26.11.2020
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 6/2021
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-020-03524-3

Weitere Artikel der Ausgabe 6/2021

The Journal of Supercomputing 6/2021 Zur Ausgabe