Skip to main content

09.04.2024 | Regular Paper

Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees

verfasst von: Vítor Bezerra Silva, Dimas Cassimiro Nascimento

Erschienen in: Knowledge and Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multi-Attribute Similarity Join represents an important task for a variety of applications. Due to a large amount of data, several techniques and approaches were proposed to avoid superfluous comparisons between entities. One of these techniques is denominated Index Tree. In this work, we proposed an adaptive version (Adaptive Index Tree) of the state-of-the-art Index Tree for multi-attribute data. Our method selects the best filter configuration to construct the Adaptive Index Tree. We also proposed a reduced version of the Index Trees, aiming to improve the trade-off between efficacy and efficiency for the Similarity Join task. Finally, we proposed Filter and Feature selectors designed for the Similarity Join task. To evaluate the impact of the proposed approaches, we employed five real-world datasets to perform the experimental analysis. Based on the experiments, we conclude that our reduced approaches have produced superior results when compared to the state-of-the-art approach, specially when dealing with datasets that present a significant number of attributes and/or and expressive attribute sizes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Almeida J, da Torres RS, Leite NJ (2010) Bp-tree: an efficient index for similarity search in high-dimensional metric spaces. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 1365–1368 Almeida J, da Torres RS, Leite NJ (2010) Bp-tree: an efficient index for similarity search in high-dimensional metric spaces. In: Proceedings of the 19th ACM international conference on information and knowledge management, pp 1365–1368
2.
Zurück zum Zitat Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on very large data bases, pp 918–929 Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In Proceedings of the 32nd international conference on very large data bases, pp 918–929
3.
Zurück zum Zitat Aronovich L, Spiegler I (2007) Cm-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946CrossRef Aronovich L, Spiegler I (2007) Cm-tree: a dynamic clustered index for similarity search in metric databases. Data Knowl Eng 63(3):919–946CrossRef
4.
Zurück zum Zitat Bahri A, Zouaki H, Thami ROH (2016) Blbtree: an efficient index structure for fast search. Int Rev Comput Softw (IRECOS), 11(10) Bahri A, Zouaki H, Thami ROH (2016) Blbtree: an efficient index structure for fast search. Int Rev Comput Softw (IRECOS), 11(10)
5.
Zurück zum Zitat Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8–13):1157–1166CrossRef Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8–13):1157–1166CrossRef
6.
Zurück zum Zitat Christiani T, Pagh R (2017) Set similarity search beyond minhash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107 Christiani T, Pagh R (2017) Set similarity search beyond minhash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107
7.
Zurück zum Zitat Christiani T, Pagh R, Sivertsen J (2018) Scalable and robust set similarity join. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 1240–1243. IEEE Christiani T, Pagh R, Sivertsen J (2018) Scalable and robust set similarity join. In: 2018 IEEE 34th international conference on data engineering (ICDE), pp 1240–1243. IEEE
8.
Zurück zum Zitat Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K (2020) An overview of end-to-end entity resolution for big data. ACM Comput Surv (CSUR) 53(6):1–42CrossRef Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K (2020) An overview of end-to-end entity resolution for big data. ACM Comput Surv (CSUR) 53(6):1–42CrossRef
9.
Zurück zum Zitat Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. Vldb 97:426–435 Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. Vldb 97:426–435
10.
Zurück zum Zitat Ferchichi A, Gouider MS (2014) Bstree: an incremental indexing structure for similarity search and real time monitoring of data streams. In: Future Information Technology, pp 185–190. Springer Ferchichi A, Gouider MS (2014) Bstree: an incremental indexing structure for similarity search and real time monitoring of data streams. In: Future Information Technology, pp 185–190. Springer
11.
Zurück zum Zitat Jia L, Zhang L, Guoxian Yu, You J, Ding J, Li M (2018) A survey on set similarity search and join. Int J Perform Eng 14(2):245 Jia L, Zhang L, Guoxian Yu, You J, Ding J, Li M (2018) A survey on set similarity search and join. Int J Perform Eng 14(2):245
12.
Zurück zum Zitat Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493CrossRef Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493CrossRef
13.
Zurück zum Zitat Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328MathSciNet Kuo FY, Sloan IH (2005) Lifting the curse of dimensionality. Not AMS 52(11):1320–1328MathSciNet
14.
Zurück zum Zitat Kurita T (2019) Principal component analysis (PCA). Computer vision: a reference guide, pp 1–4 Kurita T (2019) Principal component analysis (PCA). Computer vision: a reference guide, pp 1–4
15.
Zurück zum Zitat Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1137–1151 Li G, He J, Deng D, Li J (2015) Efficient similarity join and search on multi-attribute data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1137–1151
16.
Zurück zum Zitat Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647CrossRef Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647CrossRef
17.
Zurück zum Zitat Ortona S, Orsi G, Buoncristiano M, Furche T (2015) Wadar: joint wrapper and data repair. Proc VLDB Endow 8(12):1996–1999CrossRef Ortona S, Orsi G, Buoncristiano M, Furche T (2015) Wadar: joint wrapper and data repair. Proc VLDB Endow 8(12):1996–1999CrossRef
18.
Zurück zum Zitat Ribeiro LA, Borges FF, do Carmo ODJ (2020) A framework for set similarity join on multi-attribute data. In: SBBD, pp 61–72 Ribeiro LA, Borges FF, do Carmo ODJ (2020) A framework for set similarity join on multi-attribute data. In: SBBD, pp 61–72
19.
Zurück zum Zitat Skopal T, Lokoč J (2008) Nm-tree: flexible approximate similarity search in metric and non-metric spaces. In: International conference on database and expert systems applications, pp 312–325. Springer Skopal T, Lokoč J (2008) Nm-tree: flexible approximate similarity search in metric and non-metric spaces. In: International conference on database and expert systems applications, pp 312–325. Springer
20.
Zurück zum Zitat Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2018) Ranking based unsupervised feature selection methods: an empirical comparative study in high dimensional datasets. In: Mexican international conference on artificial intelligence, p 205–218. Springer Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2018) Ranking based unsupervised feature selection methods: an empirical comparative study in high dimensional datasets. In: Mexican international conference on artificial intelligence, p 205–218. Springer
21.
Zurück zum Zitat Sebastian VS, José L, Alberto C (2016) BoD-books on demand, big data on real-world applications Sebastian VS, José L, Alberto C (2016) BoD-books on demand, big data on real-world applications
22.
Zurück zum Zitat Wang Y, Qin J, Wang W (2017) Efficient approximate entity matching using Jaro–Winkler distance. In: International conference on web information systems engineering, pp 231–239. Springer Wang Y, Qin J, Wang W (2017) Efficient approximate entity matching using Jaro–Winkler distance. In: International conference on web information systems engineering, pp 231–239. Springer
23.
Zurück zum Zitat Minghe Y, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comp Sci 10(3):399–417 Minghe Y, Li G, Deng D, Feng J (2016) String similarity search and join: a survey. Front Comp Sci 10(3):399–417
24.
Zurück zum Zitat Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D (2010) Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 915–926 Zhang Z, Hadjieleftheriou M, Ooi BC, Srivastava D (2010) Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 915–926
Metadaten
Titel
Enhancing Multi-Attribute Similarity Join using Reduced and Adaptive Index Trees
verfasst von
Vítor Bezerra Silva
Dimas Cassimiro Nascimento
Publikationsdatum
09.04.2024
Verlag
Springer London
Erschienen in
Knowledge and Information Systems
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-024-02089-4

Premium Partner