Skip to main content
Erschienen in: Computing 10/2021

15.06.2021 | Regular Paper

Screening hardware and volume factors in distributed machine learning algorithms on spark

A design of experiments (DoE) based approach

verfasst von: Jairson B. Rodrigues, Germano C. Vasconcelos, Paulo R. M. Maciel

Erschienen in: Computing | Ausgabe 10/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naïve Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
An experimental unit refers to the complete execution of the ML technique from start to final, embracing training, test, and evaluation tasks.
 
2
Created only to process the workload, being destroyed after execution to minimize interference.
 
3
If \(\lambda \approx -1 \rightarrow y^* = 1/y\); \(\lambda \approx -0.5 \rightarrow y* = sqrt(1/y)\); \(\lambda \approx 0 \rightarrow y* = log(y)\), \(\lambda \approx 0.5 \rightarrow y* = sqrt(y)\), and \(\lambda \approx 1 \rightarrow \) no transformation.
 
4
Based on R Language and Plotly library [62] for web-based data visualization.
 
Literatur
2.
Zurück zum Zitat Pospelova M (2015) Real time autotuning for mapreduce on hadoop/yarn. Ph.D. thesis, Carleton University Ottawa Pospelova M (2015) Real time autotuning for mapreduce on hadoop/yarn. Ph.D. thesis, Carleton University Ottawa
3.
Zurück zum Zitat Piatetsky-Shapiro G (1991) Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Magazine 11(5):68 Piatetsky-Shapiro G (1991) Knowledge discovery in real databases: a report on the IJCAI-89 workshop. AI Magazine 11(5):68
4.
Zurück zum Zitat Cox M, Ellsworth D (1997) Managing big data for scientific visualization. ACM Siggraph 97:146–162 Cox M, Ellsworth D (1997) Managing big data for scientific visualization. ACM Siggraph 97:146–162
5.
Zurück zum Zitat Luvizan S, Meirelles F, Diniz EH (2014) Big Data: publication evolution and research opportunitie. In Anais da 11a Conferência Internacional sobre Sistemas de Informação e Gestão de Tecnologia. São Paulo, SP Luvizan S, Meirelles F, Diniz EH (2014) Big Data: publication evolution and research opportunitie. In Anais da 11a Conferência Internacional sobre Sistemas de Informação e Gestão de Tecnologia. São Paulo, SP
7.
Zurück zum Zitat Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, pp. 29–43 Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, pp. 29–43
8.
Zurück zum Zitat Dean J, Ghemawat S. in Proceedings of the 6th conference on symposium on Opearting systems design & implementation-volume 6 (USENIX Association, 2004), pp. 10–10 Dean J, Ghemawat S. in Proceedings of the 6th conference on symposium on Opearting systems design & implementation-volume 6 (USENIX Association, 2004), pp. 10–10
10.
Zurück zum Zitat Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: EEE International conference on big data (Big Data (IEEE), pp. 2936–2938 Li K, Deolalikar V, Pradhan N (2015) Big data gathering and mining pipelines for CRM using open-source. In: EEE International conference on big data (Big Data (IEEE), pp. 2936–2938
11.
Zurück zum Zitat Dharsandiya AN, Patel MR (2016) A review on Frequent Itemset Mining algorithms in social network data. In: Wireless Communications, Signal Processing and Networking (WiSPNET). In: International Conference on (IEEE), pp. 1046–1048 Dharsandiya AN, Patel MR (2016) A review on Frequent Itemset Mining algorithms in social network data. In: Wireless Communications, Signal Processing and Networking (WiSPNET). In: International Conference on (IEEE), pp. 1046–1048
12.
Zurück zum Zitat Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl-Based Syst 117:3CrossRef Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl-Based Syst 117:3CrossRef
14.
Zurück zum Zitat Baldacci L, Golfarelli M (2018) A cost model for Spark SQL. IEEE Trans Knowl Data Eng 31(5):819CrossRef Baldacci L, Golfarelli M (2018) A cost model for Spark SQL. IEEE Trans Knowl Data Eng 31(5):819CrossRef
15.
Zurück zum Zitat Munir RF, Abelló A (2019) Automatically configuring parallelism for hybrid layouts. European conference on advances in databases and information systems. Springer, Cham, pp 120–125 Munir RF, Abelló A (2019) Automatically configuring parallelism for hybrid layouts. European conference on advances in databases and information systems. Springer, Cham, pp 120–125
16.
Zurück zum Zitat Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21 Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21
17.
Zurück zum Zitat Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. Int J Comput Trends Technol 19(1):9–14CrossRef Iqbal MH, Soomro TR (2015) Big data analysis: apache storm perspective. Int J Comput Trends Technol 19(1):9–14CrossRef
20.
Zurück zum Zitat Marsland S (2014) Machine learning: an algorithmic perspective. CRC Press, Boca RatonCrossRef Marsland S (2014) Machine learning: an algorithmic perspective. CRC Press, Boca RatonCrossRef
21.
Zurück zum Zitat Fisher RA, Wishart J (1945) The arrangement of field experiments and the statistical reduction of the results. 10 (HM Stationery Office) Fisher RA, Wishart J (1945) The arrangement of field experiments and the statistical reduction of the results. 10 (HM Stationery Office)
23.
25.
Zurück zum Zitat Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S (2011) Starfish: a self-tuning system for big data analytics. Cidr 11:261–272 Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin FB, Babu S (2011) Starfish: a self-tuning system for big data analytics. Cidr 11:261–272
27.
Zurück zum Zitat Ardagna D, Bernardi S, Gianniti E, Aliabadi SK, Perez-Palacin D, Requeno JI (2016) Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 599–613CrossRef Ardagna D, Bernardi S, Gianniti E, Aliabadi SK, Perez-Palacin D, Requeno JI (2016) Modeling performance of hadoop applications: a journey from queueing networks to stochastic well formed nets. International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 599–613CrossRef
28.
Zurück zum Zitat Sidhanta S, Golab W, Mukhopadhyay S (2016) Optex: a deadline-aware cost optimization model for spark. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (IEEE), pp. 193–202 Sidhanta S, Golab W, Mukhopadhyay S (2016) Optex: a deadline-aware cost optimization model for spark. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (IEEE), pp. 193–202
29.
Zurück zum Zitat Venkataraman S, Yang Z, Franklin M, Recht B, Nsdi I (2016) Ernest : efficient performance prediction for large-scale advanced analytics. In: NSDI’16 Proceedings of the 13th USENIX conference on networked systems design and implementation, pp. 363–378 Venkataraman S, Yang Z, Franklin M, Recht B, Nsdi I (2016) Ernest : efficient performance prediction for large-scale advanced analytics. In: NSDI’16 Proceedings of the 13th USENIX conference on networked systems design and implementation, pp. 363–378
32.
Zurück zum Zitat Wineberg M, Christensen S (2004) An introduction to statistics for EC experimental analysis. Tutorial at the ieee congress on evolutionary computation Wineberg M, Christensen S (2004) An introduction to statistics for EC experimental analysis. Tutorial at the ieee congress on evolutionary computation
33.
Zurück zum Zitat Rathod M, Suthar D, Patel H, Shelat P, Parejiya P (2019) Microemulsion based nasal spray: a systemic approach for non-CNS drug, its optimization, characterization and statistical modelling using QbD principles. J Drug Deliv Sci Technol 49:286CrossRef Rathod M, Suthar D, Patel H, Shelat P, Parejiya P (2019) Microemulsion based nasal spray: a systemic approach for non-CNS drug, its optimization, characterization and statistical modelling using QbD principles. J Drug Deliv Sci Technol 49:286CrossRef
34.
Zurück zum Zitat Kuo CC, Liu HA, Chang CM (2020) Optimization of vacuum casting process parameters to enhance tensile strength of components using design of experiments approach. Int J Adv Manuf Technol 106(9):3775–3785CrossRef Kuo CC, Liu HA, Chang CM (2020) Optimization of vacuum casting process parameters to enhance tensile strength of components using design of experiments approach. Int J Adv Manuf Technol 106(9):3775–3785CrossRef
35.
Zurück zum Zitat Amin MM, Kiani A (2020) Multi-disciplinary analysis of a strip stabilizer using body-fluid-structure interaction simulation and design of experiments (DOE). J Appl Fluid Mech 13(1):261CrossRef Amin MM, Kiani A (2020) Multi-disciplinary analysis of a strip stabilizer using body-fluid-structure interaction simulation and design of experiments (DOE). J Appl Fluid Mech 13(1):261CrossRef
36.
Zurück zum Zitat Packianather M, Drake P, Rowlands H (2000) Optimizing the parameters of multilayered feedforward neural networks through Taguchi design of experiments. Qual Reliab Eng Int 16(6):461CrossRef Packianather M, Drake P, Rowlands H (2000) Optimizing the parameters of multilayered feedforward neural networks through Taguchi design of experiments. Qual Reliab Eng Int 16(6):461CrossRef
37.
Zurück zum Zitat Staelin C (2003) Parameter selection for support vector machines, Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1 1 Staelin C (2003) Parameter selection for support vector machines, Hewlett-Packard Company, Tech. Rep. HPL-2002-354R1 1
38.
Zurück zum Zitat Bates S, Sienz J, Toropov V (2004) Formulation of the optimal Latin hypercube design of experiments using a permutation genetic algorithm. In: 45th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics & materials conference, p. 2011 Bates S, Sienz J, Toropov V (2004) Formulation of the optimal Latin hypercube design of experiments using a permutation genetic algorithm. In: 45th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics & materials conference, p. 2011
39.
Zurück zum Zitat Ridge E (2007) Design of experiments for the tuning of optimisation algorithms (Citeseer) Ridge E (2007) Design of experiments for the tuning of optimisation algorithms (Citeseer)
40.
Zurück zum Zitat Balestrassi PP, Popova E, Paiva Ad, Lima JM (2009) Design of experiments on neural network’s training for nonlinear time series forecasting. Neurocomputing 72(4–6):1160CrossRef Balestrassi PP, Popova E, Paiva Ad, Lima JM (2009) Design of experiments on neural network’s training for nonlinear time series forecasting. Neurocomputing 72(4–6):1160CrossRef
41.
Zurück zum Zitat Pais MS, Peretta IS, Yamanaka K, Pinto ER (2014) Factorial design analysis applied to the performance of parallel evolutionary algorithms. J Brazil Comput Soc 20(1):6MathSciNetCrossRef Pais MS, Peretta IS, Yamanaka K, Pinto ER (2014) Factorial design analysis applied to the performance of parallel evolutionary algorithms. J Brazil Comput Soc 20(1):6MathSciNetCrossRef
43.
Zurück zum Zitat Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1 Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6(70):1
45.
Zurück zum Zitat Huai Y, Lee R, Zhang S, Xia CH, Zhang X, in Proceedings of the 2nd ACM symposium on cloud computing (ACM, 2011), p. 4 Huai Y, Lee R, Zhang S, Xia CH, Zhang X, in Proceedings of the 2nd ACM symposium on cloud computing (ACM, 2011), p. 4
48.
Zurück zum Zitat Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc VLDB Endowment 8(13):2110–2121CrossRef Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. Proc VLDB Endowment 8(13):2110–2121CrossRef
49.
Zurück zum Zitat Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24CrossRef Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24CrossRef
50.
51.
Zurück zum Zitat Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley & Sons, Hoboken Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley & Sons, Hoboken
52.
Zurück zum Zitat Montgomery D (2017) Design and analysis of experiments. John Wiley & Sons, Hoboken Montgomery D (2017) Design and analysis of experiments. John Wiley & Sons, Hoboken
53.
Zurück zum Zitat Montgomery DC (2017) Design and analysis of experiments. John wiley & sons, Hoboken Montgomery DC (2017) Design and analysis of experiments. John wiley & sons, Hoboken
54.
Zurück zum Zitat Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. John Wiley & Sons, HobokenCrossRefMATH Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. John Wiley & Sons, HobokenCrossRefMATH
55.
Zurück zum Zitat Genuer R, Poggi JM, Tuleau-Malot C, Villa-Vialaneix N (2017) Random forests for big data. Big Data Res 9:28CrossRef Genuer R, Poggi JM, Tuleau-Malot C, Villa-Vialaneix N (2017) Random forests for big data. Big Data Res 9:28CrossRef
56.
Zurück zum Zitat Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–97MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–97MATH
57.
Zurück zum Zitat Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386CrossRef Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386CrossRef
58.
Zurück zum Zitat McCallum A, Nigam K et al (1998) AAAI-98 workshop on learning for text categorization. A comparison of event models for naive bayes text classification 752:41–48 McCallum A, Nigam K et al (1998) AAAI-98 workshop on learning for text categorization. A comparison of event models for naive bayes text classification 752:41–48
60.
Zurück zum Zitat Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data, arXiv preprint arXiv:1911.00359 Wenzek G, Lachaux MA, Conneau A, Chaudhary V, Guzman F, Joulin A, Grave E (2019) Ccnet: Extracting high quality monolingual datasets from web crawl data, arXiv preprint arXiv:​1911.​00359
61.
Zurück zum Zitat Box GE, Cox DR (1964) An analysis of transformations. J Royal Stat Soc: Series B (Methodological) 26(2):211MATH Box GE, Cox DR (1964) An analysis of transformations. J Royal Stat Soc: Series B (Methodological) 26(2):211MATH
62.
Zurück zum Zitat Sievert C (2020) Interactive web-based data visualization with R, plotly, and shiny. CRC Press, Boca RatonCrossRef Sievert C (2020) Interactive web-based data visualization with R, plotly, and shiny. CRC Press, Boca RatonCrossRef
64.
Zurück zum Zitat Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications, vol 20. Springer, BerlinCrossRefMATH Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications, vol 20. Springer, BerlinCrossRefMATH
Metadaten
Titel
Screening hardware and volume factors in distributed machine learning algorithms on spark
A design of experiments (DoE) based approach
verfasst von
Jairson B. Rodrigues
Germano C. Vasconcelos
Paulo R. M. Maciel
Publikationsdatum
15.06.2021
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 10/2021
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-021-00965-3

Weitere Artikel der Ausgabe 10/2021

Computing 10/2021 Zur Ausgabe