Skip to main content
Top

2018 | OriginalPaper | Chapter

RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming

Authors : Yikai Gong, Richard O. Sinnott, Paul Rimba

Published in: Computational Science – ICCS 2018

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Clustering algorithms are essential for many big data applications involving point-based data, e.g. user generated social media data from platforms such as Twitter. One of the most common approaches for clustering is DBSCAN. However, DBSCAN has numerous limitations. The algorithm itself is based on traversing the whole dataset and identifying the neighbours around each point. This approach is not suitable when data is created and streamed in real-time however. Instead a more dynamic approach is required. This paper presents a new approach, RT-DBSCAN, that supports real-time clustering of data based on continuous cluster checkpointing. This approach overcomes many of the issues of existing clustering algorithms such as DBSCAN. The platform is realised using Apache Spark running over large-scale Cloud resources and container based technologies to support scaling. We benchmark the work using streamed social media content (Twitter) and show the advantages in performance and flexibility of RT-DBSCAN over other clustering approaches.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28, pp. 49–60. ACM (1999)CrossRef Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM Sigmod Record, vol. 28, pp. 49–60. ACM (1999)CrossRef
2.
go back to reference Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl. Eng. 60(1), 208–221 (2007)CrossRef Birant, D., Kut, A.: ST-DBSCAN: an algorithm for clustering spatial-temporal data. Data Knowl. Eng. 60(1), 208–221 (2007)CrossRef
3.
go back to reference Chandra, B.: Hybrid clustering algorithm. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 1345–1348. IEEE (2009) Chandra, B.: Hybrid clustering algorithm. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pp. 1345–1348. IEEE (2009)
4.
go back to reference Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007) Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2007)
5.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
6.
go back to reference Erwig, M., Gu, R.H., Schneider, M., Vazirgiannis, M., et al.: Spatio-temporal data types: an approach to modeling and querying moving objects in databases. GeoInformatica 3(3), 269–296 (1999)CrossRef Erwig, M., Gu, R.H., Schneider, M., Vazirgiannis, M., et al.: Spatio-temporal data types: an approach to modeling and querying moving objects in databases. GeoInformatica 3(3), 269–296 (1999)CrossRef
7.
go back to reference Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: VLDB, vol. 98, pp. 323–333 (1998) Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: VLDB, vol. 98, pp. 323–333 (1998)
8.
go back to reference Ester, M., Kriegel, H.-P., Sander, J., , Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996) Ester, M., Kriegel, H.-P., Sander, J., , Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
9.
go back to reference Gong, Y., Morandini, L., Sinnott, R.O.: The design and benchmarking of a cloud-based platform for processing and visualization of traffic data. In: IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 13–20. IEEE (2017) Gong, Y., Morandini, L., Sinnott, R.O.: The design and benchmarking of a cloud-based platform for processing and visualization of traffic data. In: IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 13–20. IEEE (2017)
10.
go back to reference Hagedorn, S., Götze, P., Sattler, K.-U.: The STARK framework for spatio-temporal data analytics on spark. In: BTW, pp. 123–142 (2017) Hagedorn, S., Götze, P., Sattler, K.-U.: The STARK framework for spatio-temporal data analytics on spark. In: BTW, pp. 123–142 (2017)
11.
go back to reference Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2), 107–145 (2001)CrossRef Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2), 107–145 (2001)CrossRef
12.
go back to reference He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)MathSciNetCrossRef He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)MathSciNetCrossRef
13.
go back to reference Hinneburg, A., Keim, D.A., et al.: An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol. 98, pp. 58–65 (1998) Hinneburg, A., Keim, D.A., et al.: An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol. 98, pp. 58–65 (1998)
15.
go back to reference Sander, J., Ester, M., Kriegel, H.-P., Xiaowei, X.: Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min. Knowl. Disc. 2(2), 169–194 (1998)CrossRef Sander, J., Ester, M., Kriegel, H.-P., Xiaowei, X.: Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min. Knowl. Disc. 2(2), 169–194 (1998)CrossRef
16.
17.
go back to reference Spieth, C., Streichert, F., Speer, N., Zell, A.: Clustering-based approach to identify solutions for the inference of regulatory networks. In: The 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 660–667. IEEE (2005) Spieth, C., Streichert, F., Speer, N., Zell, A.: Clustering-based approach to identify solutions for the inference of regulatory networks. In: The 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 660–667. IEEE (2005)
18.
go back to reference Viswanath, P., Pinkesh, R.: L-DBSCAN: a fast hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 1, pp. 912–915. IEEE (2006) Viswanath, P., Pinkesh, R.: L-DBSCAN: a fast hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol. 1, pp. 912–915. IEEE (2006)
19.
go back to reference Wen, J.-R., Nie, J.-Y., Zhang, H.-J.: Query clustering using user logs. ACM Trans. Inf. Syst. 20(1), 59–81 (2002)CrossRef Wen, J.-R., Nie, J.-Y., Zhang, H.-J.: Query clustering using user logs. ACM Trans. Inf. Syst. 20(1), 59–81 (2002)CrossRef
Metadata
Title
RT-DBSCAN: Real-Time Parallel Clustering of Spatio-Temporal Data Using Spark-Streaming
Authors
Yikai Gong
Richard O. Sinnott
Paul Rimba
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-93698-7_40

Premium Partner