Skip to main content
Top

2021 | OriginalPaper | Chapter

Streaming Set Similarity Joins

Authors : Lucas Pacífico, Leonardo Andrade Ribeiro

Published in: Enterprise Information Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We consider the problem of efficiently answering set similarity joins over streams. This problem is challenging both in terms of CPU cost, because similarity matching is computationally much more expensive than equality comparisons, and memory requirements, due to the unbounded nature of streams. This article presents SSTR, a novel similarity join algorithm for streams of sets. We adopt the concept of temporal similarity and exploit its properties to improve efficiency and reduce memory usage. Furthermore, we propose a sampling-based technique for ordering set elements that increases the pruning power of SSTR and, thus, reduce even further the number of similarity comparisons and memory consumption. We provide an extensive experimental study on several synthetic as well as real-world datasets. Our results show that the techniques we proposed significantly reduce memory consumption, improve scalability, and lead to substantial performance gains over the baseline approach.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In our implementation, we avoid repeated calculations of candidate-specific thresholds and overlap bounds by storing them in the map M.
 
2
dblp.uni-trier.de/xml.
 
Literature
1.
go back to reference Abadi, D.J., et al.: The design of the borealis stream processing engine. In: Proceedings of the Conference on Innovative Data Systems Research, pp. 277–289 (2005) Abadi, D.J., et al.: The design of the borealis stream processing engine. In: Proceedings of the Conference on Innovative Data Systems Research, pp. 277–289 (2005)
2.
go back to reference Amagata, D., Hara, T., Xiao, C.: Dynamic Set kNN Self-Join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 818–829 (2019) Amagata, D., Hara, T., Xiao, C.: Dynamic Set kNN Self-Join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 818–829 (2019)
3.
go back to reference Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 784–795 (2014) Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 784–795 (2014)
4.
go back to reference Baayen, R.H.: Word Frequency Distributions, Text, Speech and Language Technology, vol. 18. Kluwer Academic Publishers (2001) Baayen, R.H.: Word Frequency Distributions, Text, Speech and Language Technology, vol. 18. Kluwer Academic Publishers (2001)
5.
go back to reference Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 1–16 (2002) Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 1–16 (2002)
7.
go back to reference Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International World Wide Web Conferences, pp. 131–140. ACM (2007) Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International World Wide Web Conferences, pp. 131–140. ACM (2007)
8.
go back to reference Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the ACM SIGACT Symposium on Theory of Computing, pp. 327–336. ACM (1998) Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the ACM SIGACT Symposium on Theory of Computing, pp. 327–336. ACM (1998)
9.
go back to reference Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015) Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
10.
go back to reference do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: Proceedings of the Symposium on Advances in Databases and Information Systems, pp. 216–230 (2018) do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: Proceedings of the Symposium on Advances in Databases and Information Systems, pp. 216–230 (2018)
11.
go back to reference Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the IEEE International Conference on Data Engineering, p. 5. IEEE Computer Society (2006) Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the IEEE International Conference on Data Engineering, p. 5. IEEE Computer Society (2006)
12.
go back to reference Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 1240–1243. IEEE Computer Society (2018) Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 1240–1243. IEEE Computer Society (2018)
13.
go back to reference Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)CrossRef Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)CrossRef
14.
go back to reference Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2006) Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2006)
15.
go back to reference Dutta, S., Narang, A., Bera, S.K.: Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6(8), 589–600 (2013)CrossRef Dutta, S., Narang, A., Bera, S.K.: Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6(8), 589–600 (2013)CrossRef
16.
go back to reference Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: bigdata, pp. 964–969 (2017) Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: bigdata, pp. 964–969 (2017)
17.
go back to reference Lian, X., Chen, L.: Efficient similarity join over multiple stream time series. IEEE Trans. Knowl. Data Eng. 21(11), 1544–1558 (2009)CrossRef Lian, X., Chen, L.: Efficient similarity join over multiple stream time series. IEEE Trans. Knowl. Data Eng. 21(11), 1544–1558 (2009)CrossRef
18.
go back to reference Lian, X., Chen, L.: Set similarity join on probabilistic data. Proc. VLDB Endow. 3(1), 650–659 (2010)CrossRef Lian, X., Chen, L.: Set similarity join on probabilistic data. Proc. VLDB Endow. 3(1), 650–659 (2010)CrossRef
19.
go back to reference Lian, X., Chen, L.: Similarity join processing on uncertain data streams. IEEE Trans. Knowl. Data Eng. 23(11), 1718–1734 (2011)CrossRef Lian, X., Chen, L.: Similarity join processing on uncertain data streams. IEEE Trans. Knowl. Data Eng. 23(11), 1718–1734 (2011)CrossRef
20.
go back to reference Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016) Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)
21.
go back to reference Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the International World Wide Web Conferences, pp. 12–21 (2005) Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the International World Wide Web Conferences, pp. 12–21 (2005)
22.
go back to reference Morales, G.D.F., Gionis, A.: Streaming similarity self-join. Proc. VLDB Endow. 9(10), 792–803 (2016)CrossRef Morales, G.D.F., Gionis, A.: Streaming similarity self-join. Proc. VLDB Endow. 9(10), 792–803 (2016)CrossRef
23.
go back to reference Pacífico, L., Ribeiro, L.A.: SSTR: set similarity join over stream data. In: International Conference on Enterprise Information Systems, pp. 52–60. SCITEPRESS (2020) Pacífico, L., Ribeiro, L.A.: SSTR: set similarity join over stream data. In: International Conference on Enterprise Information Systems, pp. 52–60. SCITEPRESS (2020)
24.
go back to reference Quirino, R.D., Ribeiro-Júnior, S., Ribeiro, L.A., Martins, W.S.: fgssjoin: A GPU-based algorithm for set similarity joins. In: International Conference on Enterprise Information Systems, pp. 152–161. SCITEPRESS (2017) Quirino, R.D., Ribeiro-Júnior, S., Ribeiro, L.A., Martins, W.S.: fgssjoin: A GPU-based algorithm for set similarity joins. In: International Conference on Enterprise Information Systems, pp. 152–161. SCITEPRESS (2017)
25.
go back to reference Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: towards a framework for integrating similarity join algorithms and clustering. In: International Conference on Enterprise Information Systems, pp. 75–80. SCITEPRESS (2016) Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: towards a framework for integrating similarity join algorithms and clustering. In: International Conference on Enterprise Information Systems, pp. 75–80. SCITEPRESS (2016)
26.
go back to reference Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: a framework for incorporating clustering into set similarity join algorithms. LNCS Trans. Large Scale Data Knowl. Center. Syst. 38, 89–118 (2018) Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: a framework for incorporating clustering into set similarity join algorithms. LNCS Trans. Large Scale Data Knowl. Center. Syst. 38, 89–118 (2018)
27.
go back to reference Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)CrossRef Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)CrossRef
28.
go back to reference Ribeiro, L.A., Schneider, N.C., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: Bridging database applications and declarative similarity matching. J. Inf. Data Manage. 7(3), 217–232 (2016) Ribeiro, L.A., Schneider, N.C., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: Bridging database applications and declarative similarity matching. J. Inf. Data Manage. 7(3), 217–232 (2016)
29.
go back to reference Ribeiro-Júnior, S., Quirino, R.D., Ribeiro, L.A., Martins, W.S.: Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manage. 8(3), 255–270 (2017) Ribeiro-Júnior, S., Quirino, R.D., Ribeiro, L.A., Martins, W.S.: Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manage. 8(3), 255–270 (2017)
30.
go back to reference Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004) Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)
31.
go back to reference Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: A generic framework for top-k pairs and top-k objects queries over sliding windows. IEEE Trans. Knowl. Data Eng. 26(6), 1349–1366 (2014)CrossRef Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: A generic framework for top-k pairs and top-k objects queries over sliding windows. IEEE Trans. Knowl. Data Eng. 26(6), 1349–1366 (2014)CrossRef
32.
go back to reference Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the ACM Symposium on Applied Computing, pp. 967–972 (2015) Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the ACM Symposium on Applied Computing, pp. 967–972 (2015)
33.
go back to reference Stonebraker, M., Çetintemel, U., Zdonik, S.B.: The 8 requirements of real-time stream processing. SIGMOD Rec. 34(4), 42–47 (2005)CrossRef Stonebraker, M., Çetintemel, U., Zdonik, S.B.: The 8 requirements of real-time stream processing. SIGMOD Rec. 34(4), 42–47 (2005)CrossRef
34.
go back to reference Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010) Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)
36.
go back to reference Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. Proc. VLDB Endow. 10(9), 925–936 (2017)CrossRef Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. Proc. VLDB Endow. 10(9), 925–936 (2017)CrossRef
37.
go back to reference Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011) Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Metadata
Title
Streaming Set Similarity Joins
Authors
Lucas Pacífico
Leonardo Andrade Ribeiro
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-75418-1_2

Premium Partner