Skip to main content
Top

2015 | OriginalPaper | Chapter

GPU Acceleration of Set Similarity Joins

Authors : Mateus S. H. Cruz, Yusuke Kozawa, Toshiyuki Amagasa, Hiroyuki Kitagawa

Published in: Database and Expert Systems Applications

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a scheme of efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without renouncing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Böhm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. BTW 144, 57–66 (2009) Böhm, C., Noll, R., Plant, C., Zherdin, A.: Indexsupported similarity join on graphics processors. BTW 144, 57–66 (2009)
2.
go back to reference Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefMATH Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefMATH
3.
go back to reference Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006) Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
4.
go back to reference Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a Mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014) Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a Mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)
5.
go back to reference Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001) Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
6.
go back to reference Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC, pp. 769–780 (2014) Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: SC, pp. 769–780 (2014)
7.
go back to reference He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)CrossRef He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4), 21:1–21:39 (2009)CrossRef
8.
go back to reference He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008) He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: SIGMOD, pp. 511–524 (2008)
9.
go back to reference Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA (2012) Hoberock, J., Bell, N.: Thrust: A Productivity-Oriented Library for CUDA (2012)
10.
go back to reference Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. pvldb 7(8), 625–636 (2014)MATH Jiang, Y., Li, G., Feng, J., Li, W.S.: String similarity joins: an experimental evaluation. pvldb 7(8), 625–636 (2014)MATH
11.
go back to reference Li, P., Knig, A.C.: b-bit Minwise Hashing (2009). CoRR. abs/0910.3349 Li, P., Knig, A.C.: b-bit Minwise Hashing (2009). CoRR. abs/0910.3349
12.
go back to reference Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: WWW, pp. 565–566 (2012) Li, P., Shrivastava, A., König, A.C.: GPU-based minwise hashing. In: WWW, pp. 565–566 (2012)
13.
go back to reference Li, P., Owen, A.B., Zhang, C.H.: One Permutation Hashing for Efficient Search and Learning (2012). CoRR. abs/1208.1259 Li, P., Owen, A.B., Zhang, C.H.: One Permutation Hashing for Efficient Search and Learning (2012). CoRR. abs/1208.1259
14.
go back to reference Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008) Lieberman, M.D., Sankaranarayanan, J., Samet, H.: A fast similarity join algorithm using graphics processing units. In: ICDE, pp. 1111–1120 (2008)
15.
go back to reference Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012) Metwally, A., Faloutsos, C.: V-Smart-Join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB 5(8), 704–715 (2012)
16.
go back to reference NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007) NVIDIA Corporation: NVIDIA CUDA Compute Unified Device Architecture Programming Guide (2007)
17.
go back to reference OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013) OpenMP Architecture Review Board: OpenMP Application Program Interface Version 4.0 (2013)
18.
go back to reference Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graph. Forum 26(1), 80–113 (2007)CrossRefMATH Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graph. Forum 26(1), 80–113 (2007)CrossRefMATH
19.
go back to reference Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using Mapreduce. In: SIGMOD, pp. 495–506 (2010) Rares, V., Carey, M.J., Chen, L.: Efficient parallel set-similarity joins using Mapreduce. In: SIGMOD, pp. 495–506 (2010)
20.
go back to reference Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004) Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, pp. 743–754 (2004)
21.
go back to reference Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: GH, pp. 97–106 (2007) Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: GH, pp. 97–106 (2007)
22.
go back to reference Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012) Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
23.
go back to reference Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008) Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)
Metadata
Title
GPU Acceleration of Set Similarity Joins
Authors
Mateus S. H. Cruz
Yusuke Kozawa
Toshiyuki Amagasa
Hiroyuki Kitagawa
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-22849-5_26

Premium Partner