Top

Published in:

2021 | OriginalPaper | Chapter

Streaming Set Similarity Joins

Authors : Lucas Pacífico, Leonardo Andrade Ribeiro

Published in: Enterprise Information Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We consider the problem of efficiently answering set similarity joins over streams. This problem is challenging both in terms of CPU cost, because similarity matching is computationally much more expensive than equality comparisons, and memory requirements, due to the unbounded nature of streams. This article presents SSTR, a novel similarity join algorithm for streams of sets. We adopt the concept of temporal similarity and exploit its properties to improve efficiency and reduce memory usage. Furthermore, we propose a sampling-based technique for ordering set elements that increases the pruning power of SSTR and, thus, reduce even further the number of similarity comparisons and memory consumption. We provide an extensive experimental study on several synthetic as well as real-world datasets. Our results show that the techniques we proposed significantly reduce memory consumption, improve scalability, and lead to substantial performance gains over the baseline approach.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Anonimisation, Impacts and Challenges into Big Data: A Case Studies

next chapter Flexible OPC UA Data Load Optimizations on the Edge of Production

In our implementation, we avoid repeated calculations of candidate-specific thresholds and overlap bounds by storing them in the map M.

dblp.uni-trier.de/xml.

https://en.wikipedia.org/wiki/Wikipedia:Database_download.

https://developer.twitter.com/en/products/tweets.

Abadi, D.J., et al.: The design of the borealis stream processing engine. In: Proceedings of the Conference on Innovative Data Systems Research, pp. 277–289 (2005)

Amagata, D., Hara, T., Xiao, C.: Dynamic Set kNN Self-Join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 818–829 (2019)

Anastasiu, D.C., Karypis, G.: L2AP: fast cosine similarity search with prefix L-2 norm bounds. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 784–795 (2014)

Baayen, R.H.: Word Frequency Distributions, Text, Speech and Language Technology, vol. 18. Kluwer Academic Publishers (2001)

Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 1–16 (2002)

Baumgartner, J.: Reddit May 2019 submissions. Harv. Dataverse (2019). https://doi.org/10.7910/DVN/JVI8CTCrossRef

Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the International World Wide Web Conferences, pp. 131–140. ACM (2007)

Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proceedings of the ACM SIGACT Symposium on Theory of Computing, pp. 327–336. ACM (1998)

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink\(^{\rm TM}\): stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)

10.

do Carmo Oliveira, D.J., Borges, F.F., Ribeiro, L.A., Cuzzocrea, A.: Set similarity joins with complex expressions on distributed platforms. In: Proceedings of the Symposium on Advances in Databases and Information Systems, pp. 216–230 (2018)

11.

Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the IEEE International Conference on Data Engineering, p. 5. IEEE Computer Society (2006)

12.

Christiani, T., Pagh, R., Sivertsen, J.: Scalable and robust set similarity join. In: Proceedings of the IEEE International Conference on Data Engineering, pp. 1240–1243. IEEE Computer Society (2018)

13.

Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)CrossRef

14.

Deng, F., Rafiei, D.: Approximately detecting duplicates for streaming data using stable bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2006)

15.

Dutta, S., Narang, A., Bera, S.K.: Streaming quotient filter: a near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6(8), 589–600 (2013)CrossRef

16.

Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: bigdata, pp. 964–969 (2017)

17.

Lian, X., Chen, L.: Efficient similarity join over multiple stream time series. IEEE Trans. Knowl. Data Eng. 21(11), 1544–1558 (2009)CrossRef

18.

Lian, X., Chen, L.: Set similarity join on probabilistic data. Proc. VLDB Endow. 3(1), 650–659 (2010)CrossRef

19.

Lian, X., Chen, L.: Similarity join processing on uncertain data streams. IEEE Trans. Knowl. Data Eng. 23(11), 1718–1734 (2011)CrossRef

20.

Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. PVLDB 9(9), 636–647 (2016)

21.

Metwally, A., Agrawal, D., El Abbadi, A.: Duplicate detection in click streams. In: Proceedings of the International World Wide Web Conferences, pp. 12–21 (2005)

22.

Morales, G.D.F., Gionis, A.: Streaming similarity self-join. Proc. VLDB Endow. 9(10), 792–803 (2016)CrossRef

23.

Pacífico, L., Ribeiro, L.A.: SSTR: set similarity join over stream data. In: International Conference on Enterprise Information Systems, pp. 52–60. SCITEPRESS (2020)

24.

Quirino, R.D., Ribeiro-Júnior, S., Ribeiro, L.A., Martins, W.S.: fgssjoin: A GPU-based algorithm for set similarity joins. In: International Conference on Enterprise Information Systems, pp. 152–161. SCITEPRESS (2017)

25.

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SJClust: towards a framework for integrating similarity join algorithms and clustering. In: International Conference on Enterprise Information Systems, pp. 75–80. SCITEPRESS (2016)

26.

Ribeiro, L.A., Cuzzocrea, A., Bezerra, K.A.A., do Nascimento, B.H.B.: SjClust: a framework for incorporating clustering into set similarity join algorithms. LNCS Trans. Large Scale Data Knowl. Center. Syst. 38, 89–118 (2018)

27.

Ribeiro, L.A., Härder, T.: Generalizing prefix filtering to improve set similarity joins. Inf. Syst. 36(1), 62–78 (2011)CrossRef

28.

Ribeiro, L.A., Schneider, N.C., de Souza Inácio, A., Wagner, H.M., von Wangenheim, A.: Bridging database applications and declarative similarity matching. J. Inf. Data Manage. 7(3), 217–232 (2016)

29.

Ribeiro-Júnior, S., Quirino, R.D., Ribeiro, L.A., Martins, W.S.: Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manage. 8(3), 255–270 (2017)

30.

Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 743–754 (2004)

31.

Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: A generic framework for top-k pairs and top-k objects queries over sliding windows. IEEE Trans. Knowl. Data Eng. 26(6), 1349–1366 (2014)CrossRef

32.

Sidney, C.F., Mendes, D.S., Ribeiro, L.A., Härder, T.: Performance prediction for set similarity joins. In: Proceedings of the ACM Symposium on Applied Computing, pp. 967–972 (2015)

33.

Stonebraker, M., Çetintemel, U., Zdonik, S.B.: The 8 requirements of real-time stream processing. SIGMOD Rec. 34(4), 42–47 (2005)CrossRef

34.

Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)

35.

Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)MathSciNetCrossRef

36.

Wang, X., Qin, L., Lin, X., Zhang, Y., Chang, L.: Leveraging set relations in exact set similarity join. Proc. VLDB Endow. 10(9), 925–936 (2017)CrossRef

37.

Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)

Title: Streaming Set Similarity Joins
Authors: Lucas Pacífico
Leonardo Andrade Ribeiro
Publisher: Springer International Publishing
Book: Enterprise Information Systems
Print ISBN: 978-3-030-75417-4

Electronic ISBN: 978-3-030-75418-1

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-75418-1_2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner