Skip to main content
Erschienen in: Distributed and Parallel Databases 3/2019

08.06.2018

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

verfasst von: Carlos Ordonez, Yiqun Zhang, S. Lennart Johnsson

Erschienen in: Distributed and Parallel Databases | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRef Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A., Stoica, I., Zaharia, M.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRef
2.
Zurück zum Zitat Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases (DAPD) 29(3), 185–216 (2011)CrossRef Behm, A., Borkar, V.R., Carey, M.J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., Tsotras, V.J.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases (DAPD) 29(3), 185–216 (2011)CrossRef
3.
Zurück zum Zitat Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proc. ACM KDD Conference, pp. 9–15 (1998) Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proc. ACM KDD Conference, pp. 9–15 (1998)
4.
Zurück zum Zitat Chen, Q., Hsu, M., Liu, R.: Extend udf technology for integrated analytics. Data Warehous. Knowl. Discov. 5691, 256–270 (2009)CrossRef Chen, Q., Hsu, M., Liu, R.: Extend udf technology for integrated analytics. Data Warehous. Knowl. Discov. 5691, 256–270 (2009)CrossRef
5.
Zurück zum Zitat Cormode, G.: Compact summaries over large datasets. In: Proc. ACM PODS (2015) Cormode, G.: Compact summaries over large datasets. In: Proc. ACM PODS (2015)
6.
Zurück zum Zitat Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: RICARDO: integrating R and hadoop. In: Proc. ACM SIGMOD Conference, pp. 987–998 (2010) Das, S., Sismanis, Y., Beyer, K.S., Gemulla, R., Haas, P.J., McPherson, J.: RICARDO: integrating R and hadoop. In: Proc. ACM SIGMOD Conference, pp. 987–998 (2010)
7.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
8.
Zurück zum Zitat DuMouchel, W., Volinski, C., Johnson, T., Pregybon, D.: Squashing flat files flatter. In: Proc. ACM KDD Conference (1999) DuMouchel, W., Volinski, C., Johnson, T., Pregybon, D.: Squashing flat files flatter. In: Proc. ACM KDD Conference (1999)
9.
Zurück zum Zitat Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proc. KDD, pp. 69–77 (2011) Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proc. KDD, pp. 69–77 (2011)
10.
Zurück zum Zitat Gucht, D.V., Williams, R., Woodruff, D.P., Zhang, Q.: The communication complexity of distributed set-joins with applications to matrix multiplication. In: Proc. ACM PODS, pp. 199–212 (2015) Gucht, D.V., Williams, R., Woodruff, D.P., Zhang, Q.: The communication complexity of distributed set-joins with applications to matrix multiplication. In: Proc. ACM PODS, pp. 199–212 (2015)
11.
Zurück zum Zitat Hameurlain, A., Morvan, F.: Parallel relational database systems: why, how and beyond. In: Proc. DEXA Conference, pp. 302–312 (1996) Hameurlain, A., Morvan, F.: Parallel relational database systems: why, how and beyond. In: Proc. DEXA Conference, pp. 302–312 (1996)
12.
Zurück zum Zitat Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Parallel Comput. 28(4), 525–556 (2002)CrossRefMATH Hameurlain, A., Morvan, F.: CPU and incremental memory allocation in dynamic parallelization of SQL queries. Parallel Comput. 28(4), 525–556 (2002)CrossRefMATH
13.
Zurück zum Zitat Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)MATH Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)MATH
14.
Zurück zum Zitat Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)CrossRefMATH Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 1st edn. Springer, New York (2001)CrossRefMATH
15.
Zurück zum Zitat Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB 5(12), 1700–1711 (2012)CrossRef Hellerstein, J., Re, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C.: The MADlib analytics library or MAD skills, the SQL. Proc. VLDB 5(12), 1700–1711 (2012)CrossRef
16.
Zurück zum Zitat Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandier, B., Doshi, L., Bear, C.: The Vertica analytic database: C-store 7 years later. PVLDB 5(12), 1790–1801 (2012) Lamb, A., Fuller, M., Varadarajan, R., Tran, N., Vandier, B., Doshi, L., Bear, C.: The Vertica analytic database: C-store 7 years later. PVLDB 5(12), 1790–1801 (2012)
17.
Zurück zum Zitat Li, F., Nath, S.: Scalable data summarization on big data. Distrib. Parallel Databases 32(3), 313–314 (2014)CrossRef Li, F., Nath, S.: Scalable data summarization on big data. Distrib. Parallel Databases 32(3), 313–314 (2014)CrossRef
18.
Zurück zum Zitat Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)MathSciNetMATH Liu, J., Wright, S.J., Re, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)MathSciNetMATH
19.
Zurück zum Zitat Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. (TKDE) 22(12), 1752–1765 (2010)CrossRef Ordonez, C.: Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. (TKDE) 22(12), 1752–1765 (2010)CrossRef
20.
Zurück zum Zitat Ordonez, C., Mohanam, N., Garcia-Alvarado, C.: PCA for large data sets with parallel data summarization. Distrib. Parallel Databases 32(3), 377–403 (2014)CrossRef Ordonez, C., Mohanam, N., Garcia-Alvarado, C.: PCA for large data sets with parallel data summarization. Distrib. Parallel Databases 32(3), 377–403 (2014)CrossRef
21.
Zurück zum Zitat Ordonez, C., Zhang, Y., Cabrera, W.: The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans. Knowl. Data Eng. (TKDE) 28(7), 1906–1918 (2016)CrossRef Ordonez, C., Zhang, Y., Cabrera, W.: The Gamma matrix to summarize dense and sparse data sets for big data analytics. IEEE Trans. Knowl. Data Eng. (TKDE) 28(7), 1906–1918 (2016)CrossRef
22.
Zurück zum Zitat Parthasarathy, S., Dwarkadas, S.: Shared state for distributed interactive data mining applications. Distrib. Parallel Databases 11(2), 129–155 (2002)CrossRefMATH Parthasarathy, S., Dwarkadas, S.: Shared state for distributed interactive data mining applications. Distrib. Parallel Databases 11(2), 129–155 (2002)CrossRefMATH
23.
Zurück zum Zitat Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef
24.
Zurück zum Zitat Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SciDB. In: Proc. CIDR Conference (2009) Stonebraker, M., Becla, J., DeWitt, D.J., Lim, K.T., Maier, D., Ratzesberger, O., Zdonik, S.B.: Requirements for science data bases and SciDB. In: Proc. CIDR Conference (2009)
25.
Zurück zum Zitat Stonebraker, M., Brown, P., Zhang, D., Becla, J.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)CrossRef Stonebraker, M., Brown, P., Zhang, D., Becla, J.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)CrossRef
26.
Zurück zum Zitat Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: (it’s time for a complete rewrite). In: VLDB, pp. 1150–1160 (2007) Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: (it’s time for a complete rewrite). In: VLDB, pp. 1150–1160 (2007)
27.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: HotCloud USENIX Workshop (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: HotCloud USENIX Workshop (2010)
28.
Zurück zum Zitat Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Conference, pp. 103–114 (1996) Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proc. ACM SIGMOD Conference, pp. 103–114 (1996)
29.
Zurück zum Zitat Zhang, Y., Ordonez, C., Cabrera, W.: Big data analytics integrating a parallel columnar DBMS and the R language. In: Proc. of IEEE CCGrid Conference (2016) Zhang, Y., Ordonez, C., Cabrera, W.: Big data analytics integrating a parallel columnar DBMS and the R language. In: Proc. of IEEE CCGrid Conference (2016)
30.
Zurück zum Zitat Zhang, Y., Ordonez, C., Johnsson, L.: A cloud system for machine learning exploiting a parallel array DBMS. In: Proc. DEXA Workshops (BDMICS), pp. 22–26 (2017) Zhang, Y., Ordonez, C., Johnsson, L.: A cloud system for machine learning exploiting a parallel array DBMS. In: Proc. DEXA Workshops (BDMICS), pp. 22–26 (2017)
Metadaten
Titel
Scalable machine learning computing a data summarization matrix with a parallel array DBMS
verfasst von
Carlos Ordonez
Yiqun Zhang
S. Lennart Johnsson
Publikationsdatum
08.06.2018
Verlag
Springer US
Erschienen in
Distributed and Parallel Databases / Ausgabe 3/2019
Print ISSN: 0926-8782
Elektronische ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-018-7229-1

Weitere Artikel der Ausgabe 3/2019

Distributed and Parallel Databases 3/2019 Zur Ausgabe