Skip to main content
Erschienen in: The VLDB Journal 4/2019

18.03.2019 | Regular Paper

Fast and scalable method for distributed Boolean tensor factorization

verfasst von: Namyong Park, Sejoon Oh, U Kang

Erschienen in: The VLDB Journal | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

How can we analyze tensors that are composed of 0’s and 1’s? How can we efficiently analyze such Boolean tensors with millions or even billions of entries? Boolean tensors often represent relationship, membership, or occurrences of events such as subject–relation–object tuples in knowledge base data (e.g., ‘Seoul’-‘is the capital of’-‘South Korea’). Boolean tensor factorization (BTF) is a useful tool for analyzing binary tensors to discover latent factors from them. Furthermore, BTF is known to produce more interpretable and sparser results than normal factorization methods. Although several BTF algorithms exist, they do not scale up for large-scale Boolean tensors. In this paper, we propose DBTF, a distributed method for Boolean CP (DBTF-CP) and Tucker (DBTF-TK) factorizations running on the Apache Spark framework. By distributed data generation with minimal network transfer, exploiting the characteristics of Boolean operations, and with careful partitioning, DBTF successfully tackles the high computational costs and minimizes the intermediate data. Experimental results show that DBTF-CP decomposes up to 163–323 \(\times \) larger tensors than existing methods in 82–180 \(\times \) less time, and DBTF-TK decomposes up to 83–163 \(\times \) larger tensors than existing methods in 86–129 \(\times \) less time. Furthermore, both DBTF-CP and DBTF-TK exhibit near-linear scalability in terms of tensor dimensionality, density, rank, and machines.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
2.
Zurück zum Zitat Erdős, D., Miettinen, P.: Walk ’n’ merge: a scalable algorithm for Boolean tensor factorization. In: ICDM, pp. 1037–1042 (2013) Erdős, D., Miettinen, P.: Walk ’n’ merge: a scalable algorithm for Boolean tensor factorization. In: ICDM, pp. 1037–1042 (2013)
3.
Zurück zum Zitat Miettinen, P.: Boolean tensor factorizations. In: ICDM (2011) Miettinen, P.: Boolean tensor factorizations. In: ICDM (2011)
4.
Zurück zum Zitat Erdős, D., Miettinen, P.: Discovering facts with Boolean tensor tucker decomposition. In: CIKM, pp. 1569–1572 (2013) Erdős, D., Miettinen, P.: Discovering facts with Boolean tensor tucker decomposition. In: CIKM, pp. 1569–1572 (2013)
5.
6.
Zurück zum Zitat Belohlávek, R., Glodeanu, C.V., Vychodil, V.: Optimal factorization of three-way binary data using triadic concepts. Order 30(2), 437–454 (2013)MathSciNetCrossRefMATH Belohlávek, R., Glodeanu, C.V., Vychodil, V.: Optimal factorization of three-way binary data using triadic concepts. Order 30(2), 437–454 (2013)MathSciNetCrossRefMATH
7.
Zurück zum Zitat Leenen, I., Van Mechelen, I., De Boeck, P., Rosenberg, S.: INDCLAS: a three-way hierarchical classes model. Psychometrika 64(1), 9–24 (1999)CrossRefMATH Leenen, I., Van Mechelen, I., De Boeck, P., Rosenberg, S.: INDCLAS: a three-way hierarchical classes model. Psychometrika 64(1), 9–24 (1999)CrossRefMATH
8.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
9.
Zurück zum Zitat Park, N., Oh, S., Kang, U.: Fast and scalable distributed Boolean tensor factorization. In: ICDE, pp. 1071–1082 (2017) Park, N., Oh, S., Kang, U.: Fast and scalable distributed Boolean tensor factorization. In: ICDE, pp. 1071–1082 (2017)
10.
Zurück zum Zitat Cerf, L., Besson, J., Robardet, C., Boulicaut, J.: Closed patterns meet n-ary relations. TKDD 3(1), 3 (2009)CrossRef Cerf, L., Besson, J., Robardet, C., Boulicaut, J.: Closed patterns meet n-ary relations. TKDD 3(1), 3 (2009)CrossRef
11.
Zurück zum Zitat Ji, L., Tan, K., Tung, A.K.H.: Mining frequent closed cubes in 3D datasets. In: VLDB, pp. 811–822 (2006) Ji, L., Tan, K., Tung, A.K.H.: Mining frequent closed cubes in 3D datasets. In: VLDB, pp. 811–822 (2006)
12.
Zurück zum Zitat Kang, U., Papalexakis, E.E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times—algorithms and discoveries. In: KDD, pp. 316–324 (2012) Kang, U., Papalexakis, E.E., Harpale, A., Faloutsos, C.: Gigatensor: scaling tensor analysis up by 100 times—algorithms and discoveries. In: KDD, pp. 316–324 (2012)
13.
Zurück zum Zitat Jeon, B., Jeon, I., Sael, L., Kang, U.: Scout: scalable coupled matrix-tensor factorization—algorithm and discoveries. In: ICDE, pp. 811–822 (2016) Jeon, B., Jeon, I., Sael, L., Kang, U.: Scout: scalable coupled matrix-tensor factorization—algorithm and discoveries. In: ICDE, pp. 811–822 (2016)
14.
Zurück zum Zitat Jeon, I., Papalexakis, E.E., Faloutsos, C., Sael, L., Kang, U.: Mining billion-scale tensors: algorithms and discoveries. VLDB J. 25(4), 519–544 (2016)CrossRef Jeon, I., Papalexakis, E.E., Faloutsos, C., Sael, L., Kang, U.: Mining billion-scale tensors: algorithms and discoveries. VLDB J. 25(4), 519–544 (2016)CrossRef
15.
Zurück zum Zitat Sael, L., Jeon, I., Kang, U.: Scalable tensor mining. Big Data Res. 2(2), 82–86 (2015). (visions on Big Data) CrossRef Sael, L., Jeon, I., Kang, U.: Scalable tensor mining. Big Data Res. 2(2), 82–86 (2015). (visions on Big Data) CrossRef
16.
Zurück zum Zitat Park, N., Jeon, B., Lee, J., Kang, U.: Bigtensor: mining billion-scale tensor made easy. In: CIKM, pp. 2457–2460 (2016) Park, N., Jeon, B., Lee, J., Kang, U.: Bigtensor: mining billion-scale tensor made easy. In: CIKM, pp. 2457–2460 (2016)
17.
Zurück zum Zitat Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: SDM, pp. 109–117 (2014) Beutel, A., Talukdar, P.P., Kumar, A., Faloutsos, C., Papalexakis, E.E., Xing, E.P.: Flexifact: scalable flexible factorization of coupled tensors on hadoop. In: SDM, pp. 109–117 (2014)
18.
Zurück zum Zitat Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Parcube: sparse parallelizable tensor decompositions. In: ECML PKDD, pp. 521–536 (2012) Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Parcube: sparse parallelizable tensor decompositions. In: ECML PKDD, pp. 521–536 (2012)
19.
Zurück zum Zitat Li, J., Choi, J., Perros, I., Sun, J., Vuduc, R.: Model-driven sparse CP decomposition for higher-order tensors. In: IPDPS (2017) Li, J., Choi, J., Perros, I., Sun, J., Vuduc, R.: Model-driven sparse CP decomposition for higher-order tensors. In: IPDPS (2017)
20.
Zurück zum Zitat Smith, S., Park, J., Karypis, G.: An exploration of optimization algorithms for high performance tensor completion. In: SC (2016) Smith, S., Park, J., Karypis, G.: An exploration of optimization algorithms for high performance tensor completion. In: SC (2016)
21.
Zurück zum Zitat Karlsson, L., Kressner, D., Uschmajew, A.: Parallel algorithms for tensor completion in the CP format. Parallel Comput. 57, 222–234 (2016)MathSciNetCrossRef Karlsson, L., Kressner, D., Uschmajew, A.: Parallel algorithms for tensor completion in the CP format. Parallel Comput. 57, 222–234 (2016)MathSciNetCrossRef
22.
Zurück zum Zitat Shin, K., Sael, L., Kang, U.: Fully scalable methods for distributed tensor factorization. TKDE 29(1), 100–113 (2017) Shin, K., Sael, L., Kang, U.: Fully scalable methods for distributed tensor factorization. TKDE 29(1), 100–113 (2017)
23.
Zurück zum Zitat Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(\(R_{1},R_{2},\ldots, R_{N}\)) approximation of higher-order tensors. SIMAX 21(4), 1324–1342 (2000)CrossRefMATH Lathauwer, L.D., Moor, B.D., Vandewalle, J.: On the best rank-1 and rank-(\(R_{1},R_{2},\ldots, R_{N}\)) approximation of higher-order tensors. SIMAX 21(4), 1324–1342 (2000)CrossRefMATH
24.
Zurück zum Zitat Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM, pp. 363–372 (2008) Kolda, T.G., Sun, J.: Scalable tensor decompositions for multi-aspect data mining. In: ICDM, pp. 363–372 (2008)
25.
Zurück zum Zitat Oh, J., Shin, K., Papalexakis, E.E., Faloutsos, C., Yu, H.: S-hot: scalable high-order tucker decomposition. In: WSDM (2017) Oh, J., Shin, K., Papalexakis, E.E., Faloutsos, C., Yu, H.: S-hot: scalable high-order tucker decomposition. In: WSDM (2017)
26.
Zurück zum Zitat Smith, S., Karypis, G.: Accelerating the tucker decomposition with compressed sparse tensors. In: Europar (2017) Smith, S., Karypis, G.: Accelerating the tucker decomposition with compressed sparse tensors. In: Europar (2017)
27.
Zurück zum Zitat Kaya, O., Uçar, B.: High performance parallel algorithms for the tucker decomposition of sparse tensors. In: ICPP (2016) Kaya, O., Uçar, B.: High performance parallel algorithms for the tucker decomposition of sparse tensors. In: ICPP (2016)
28.
Zurück zum Zitat Oh, S., Park, N., Sael, L., Kang, U.: Scalable tucker factorization for sparse tensors—algorithms and discoveries. In: ICDE (2018) Oh, S., Park, N., Sael, L., Kang, U.: Scalable tucker factorization for sparse tensors—algorithms and discoveries. In: ICDE (2018)
29.
Zurück zum Zitat Chakaravarthy, V.T., Choi, J.W., Joseph, D.J., Liu, X., Murali, P., Sabharwal, Y., Sreedhar, D.: On optimizing distributed tucker decomposition for dense tensors (2017). CoRR arXiv:1707.05594 Chakaravarthy, V.T., Choi, J.W., Joseph, D.J., Liu, X., Murali, P., Sabharwal, Y., Sreedhar, D.: On optimizing distributed tucker decomposition for dense tensors (2017). CoRR arXiv:​1707.​05594
30.
Zurück zum Zitat Choi, J.H., Vishwanathan, S.: Dfacto: distributed factorization of tensors. In: NIPS (2014) Choi, J.H., Vishwanathan, S.: Dfacto: distributed factorization of tensors. In: NIPS (2014)
31.
Zurück zum Zitat Kaya, O., Uçar, B.: Scalable sparse tensor decompositions in distributed memory systems. In: SC, pp. 1–11 (2015) Kaya, O., Uçar, B.: Scalable sparse tensor decompositions in distributed memory systems. In: SC, pp. 1–11 (2015)
32.
Zurück zum Zitat Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IPDPS (2016) Austin, W., Ballard, G., Kolda, T.G.: Parallel tensor compression for large-scale scientific data. In: IPDPS (2016)
33.
Zurück zum Zitat Smith, S., Karypis, G.: A medium-grained algorithm for distributed sparse tensor factorization. In: IPDPS (2016) Smith, S., Karypis, G.: A medium-grained algorithm for distributed sparse tensor factorization. In: IPDPS (2016)
34.
Zurück zum Zitat Acer, S., Torun, T., Aykanat, C.: Improving medium-grain partitioning for scalable sparse tensor decomposition. IEEE Trans. Parallel Distrib. Syst. 29, 2814–2825 (2018)CrossRef Acer, S., Torun, T., Aykanat, C.: Improving medium-grain partitioning for scalable sparse tensor decomposition. IEEE Trans. Parallel Distrib. Syst. 29, 2814–2825 (2018)CrossRef
35.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004) Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
37.
Zurück zum Zitat Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: ICDM, pp. 229–238 (2009) Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: a peta-scale graph mining system. In: ICDM, pp. 229–238 (2009)
38.
Zurück zum Zitat Park, H.-M., Park, N., Myaeng, S.-H., Kang, U.: Partition aware connected component computation in distributed systems. In: ICDM (2016) Park, H.-M., Park, N., Myaeng, S.-H., Kang, U.: Partition aware connected component computation in distributed systems. In: ICDM (2016)
39.
Zurück zum Zitat Park, H.-M., Myaeng, S.-H., Kang, U.: Pte: enumerating trillion triangles on distributed systems. In: KDD, pp. 1115–1124 (2016) Park, H.-M., Myaeng, S.-H., Kang, U.: Pte: enumerating trillion triangles on distributed systems. In: KDD, pp. 1115–1124 (2016)
40.
Zurück zum Zitat Kang, U., Tong, H., Sun, J., Lin, C., Faloutsos, C.: GBASE: a scalable and general graph management system. In: KDD Kang, U., Tong, H., Sun, J., Lin, C., Faloutsos, C.: GBASE: a scalable and general graph management system. In: KDD
41.
Zurück zum Zitat Kalavri, V., Vlassov, V.: Mapreduce: limitations, optimizations and open issues. In: TrustCom, pp. 1031–1038 (2013) Kalavri, V., Vlassov, V.: Mapreduce: limitations, optimizations and open issues. In: TrustCom, pp. 1031–1038 (2013)
42.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)
43.
Zurück zum Zitat Lulli, A., Ricci, L., Carlini, E., Dazzi, P., Lucchese, C.: Cracker: crumbling large graphs into connected components. In: ISCC, pp. 574–581 (2015) Lulli, A., Ricci, L., Carlini, E., Dazzi, P., Lucchese, C.: Cracker: crumbling large graphs into connected components. In: ISCC, pp. 574–581 (2015)
44.
Zurück zum Zitat Wiewiórka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: Sparkseq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30(18), 2652–2653 (2014)CrossRef Wiewiórka, M.S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., Okoniewski, M.J.: Sparkseq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30(18), 2652–2653 (2014)CrossRef
45.
Zurück zum Zitat Gu, R., Tang, Y., Wang, Z., Wang, S., Yin, X., Yuan, C., Huang, Y.: Efficient large scale distributed matrix computation with spark. In: IEEE BigData, pp. 2327–2336 (2015) Gu, R., Tang, Y., Wang, Z., Wang, S., Yin, X., Yuan, C., Huang, Y.: Efficient large scale distributed matrix computation with spark. In: IEEE BigData, pp. 2327–2336 (2015)
46.
Zurück zum Zitat Zadeh, R.B., Meng, X., Ulanov, A., Yavuz, B., Pu, L., Venkataraman, S., Sparks, E.R., Staple, A., Zaharia, M.: Matrix computations and optimization in apache spark. In: KDD, pp. 31–38 (2016) Zadeh, R.B., Meng, X., Ulanov, A., Yavuz, B., Pu, L., Venkataraman, S., Sparks, E.R., Staple, A., Zaharia, M.: Matrix computations and optimization in apache spark. In: KDD, pp. 31–38 (2016)
47.
Zurück zum Zitat Kim, H., Park, J., Jang, J., Yoon, S.: Deepspark: spark-based deep learning supporting asynchronous updates and caffe compatibility (2016). CoRR arXiv:1602.08191 Kim, H., Park, J., Jang, J., Yoon, S.: Deepspark: spark-based deep learning supporting asynchronous updates and caffe compatibility (2016). CoRR arXiv:​1602.​08191
48.
Zurück zum Zitat Miettinen, P., Mielikäinen, T., Gionis, A., Das, G., Mannila, H.: The discrete basis problem. TKDE 20(10), 1348–1362 (2008) Miettinen, P., Mielikäinen, T., Gionis, A., Das, G., Mannila, H.: The discrete basis problem. TKDE 20(10), 1348–1362 (2008)
Metadaten
Titel
Fast and scalable method for distributed Boolean tensor factorization
verfasst von
Namyong Park
Sejoon Oh
U Kang
Publikationsdatum
18.03.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 4/2019
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-019-00538-z

Weitere Artikel der Ausgabe 4/2019

The VLDB Journal 4/2019 Zur Ausgabe

Premium Partner