Skip to main content
Top

2016 | OriginalPaper | Chapter

A Framework for Clustering and Classification of Big Data Using Spark

Authors : Xristos Mallios, Vasilis Vassalos, Tassos Venetis, Akrivi Vlachou

Published in: On the Move to Meaningful Internet Systems: OTM 2016 Conferences

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Nowadays, massive data sets are generated in many modern applications ranging from economics to bioinformatics, and from social networks to scientific databases. Typically, such data need to be processed by machine learning algorithms, which entails high processing cost and usually requires the execution of iterative algorithms. Spark has been recently proposed as a framework that supports iterative algorithms over massive data efficiently. In this paper, we design a framework for clustering and classification of big data suitable for Spark. Our framework supports different restrictions on the data exchange model that are applicable in different settings. We integrate k-means and ID3 algorithms in our framework, leading to interesting variants of our algorithms that apply to the different restrictions on the data exchange model. We implemented our algorithms over the open-source computing framework Spark and evaluated our approach in a cluster of 37-nodes, thus demonstrating the scalability of our techniques. Our experimental results show that we outperform the algorithm provided by Spark for k-means up to 31 %, while the centralized k-means is at least one order of magnitude worse.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Balcan, M.-F.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, k. (eds.) Advances in Neural Information Processing Systems 26, pp. 1995–2003. Curran Associates Inc. (2013) Balcan, M.-F.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, k. (eds.) Advances in Neural Information Processing Systems 26, pp. 1995–2003. Curran Associates Inc. (2013)
2.
go back to reference Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, pp. 368–374 (1997) Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, pp. 368–374 (1997)
3.
go back to reference Cheung, Y.-M.: k-means: a new generalized k-means clustering algorithm. Pattern Recogn. Lett. 24(15), 2883–2893 (2003)CrossRefMATH Cheung, Y.-M.: k-means: a new generalized k-means clustering algorithm. Pattern Recogn. Lett. 24(15), 2883–2893 (2003)CrossRefMATH
4.
go back to reference Datta, S., Giannella, C., Kargupta, H.: K-means clustering over a large, dynamic network. In: SDM, pp. 153–164 (2006) Datta, S., Giannella, C., Kargupta, H.: K-means clustering over a large, dynamic network. In: SDM, pp. 153–164 (2006)
5.
go back to reference Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: SIGKDD, pp. 690–698 (2011) Ferreira Cordeiro, R.L., Traina Jr., C., Machado Traina, A.J., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: SIGKDD, pp. 690–698 (2011)
6.
go back to reference Fisher, D.H., McKusick, K.B.: An empirical comparison of id3 and back-propagation. In: IJCAI, pp. 788–793 (1989) Fisher, D.H., McKusick, K.B.: An empirical comparison of id3 and back-propagation. In: IJCAI, pp. 788–793 (1989)
7.
go back to reference Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)CrossRef Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)CrossRef
8.
go back to reference Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Foundations of Computer Science, pp. 359–366 (2000) Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Foundations of Computer Science, pp. 359–366 (2000)
9.
go back to reference Jasso-Luna, O., Sosa-Sosa, V., Lopez-Arevalo, I.: An approach to building a distributed id3 classifier. In: DCAI, pp. 385–394 (2009) Jasso-Luna, O., Sosa-Sosa, V., Lopez-Arevalo, I.: An approach to building a distributed id3 classifier. In: DCAI, pp. 385–394 (2009)
10.
go back to reference Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10(1), 17–40 (2006)CrossRef Jin, R., Goswami, A., Agrawal, G.: Fast and exact out-of-core and distributed k-means clustering. Knowl. Inf. Syst. 10(1), 17–40 (2006)CrossRef
11.
go back to reference Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRefMATH Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRefMATH
12.
go back to reference MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)MathSciNetMATH MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)MathSciNetMATH
13.
go back to reference Poteras, C.M., Mihaescu, M.C., Mocanu, M.: An optimized version of the k-means clustering algorithm. In: Computer Science and Information Systems, pp. 695–699 (2014) Poteras, C.M., Mihaescu, M.C., Mocanu, M.: An optimized version of the k-means clustering algorithm. In: Computer Science and Information Systems, pp. 695–699 (2014)
14.
go back to reference Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986) Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
15.
go back to reference Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015) Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)
16.
go back to reference Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003) Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: SIGKDD, pp. 206–215 (2003)
17.
go back to reference Xiao, M.-J., Huang, L.-S., long Luo, Y., Shen, H.: Privacy preserving id3 algorithm over horizontally partitioned data. In: PDCAT, pp. 239–243 (2005) Xiao, M.-J., Huang, L.-S., long Luo, Y., Shen, H.: Privacy preserving id3 algorithm over horizontally partitioned data. In: PDCAT, pp. 239–243 (2005)
18.
go back to reference Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)
19.
go back to reference Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel k-means clustering algorithm with mpi. In: Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011) Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel k-means clustering algorithm with mpi. In: Parallel Architectures, Algorithms and Programming, pp. 60–64 (2011)
Metadata
Title
A Framework for Clustering and Classification of Big Data Using Spark
Authors
Xristos Mallios
Vasilis Vassalos
Tassos Venetis
Akrivi Vlachou
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-48472-3_20

Premium Partner