Skip to main content
Erschienen in: Knowledge and Information Systems 1/2020

26.03.2019 | Regular Paper

Exploiting block co-occurrence to control block sizes for entity resolution

verfasst von: Dimas Cassimiro Nascimento, Carlos Eduardo Santos Pires, Demetrio Gomes Mestre

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Portions of the entity schema can also be exploited [26].
 
2
For single blocking key indexing, we employ the following notation: \(\{bk_1\{e_1,e_2,\ldots \}, bk_2\{e_3,e_4,\ldots \},bk_3\{e_5,e_6,\ldots \},\ldots \}\), such that each \(bk_i\) represents an unique blocking key value generated by the employed blocking key function.
 
3
For indexing using multiple blocking key functions, we employ the following notation: \(\{\{bk_{i,1}\{e_1,e_2,\ldots \}, bk_{i,2}\{e_3,e_4,\ldots \}, bk_{i,3}\{e_5,e_6,\ldots \},\ldots \}, \{bk_{j,1}\{e_1,e_2,\ldots \},bk_{j,2}\{e_3,e_4,\ldots \}, bk_{j,3}\{e_5,e_6, \ldots \},\ldots \},\ldots \}\), s.t. each \(bk_{l,m}\) is the m-th blocking key value generated by the l-th blocking key function.
 
4
These schemes have produced encouraging experimental results in [26].
 
5
The merge operations may be performed using blocks from different blocking results (although this possibility is not shown in Fig. 1).
 
6
We assume that the changes in the blocks performed by the algorithm are reflected in the input blocking collection \(\mathcal {B}\). We use the same assumption for Algorithms 3–7.
 
7
This special case is not shown in the MaxIntersectionMerge algorithm in order to simplify its representation.
 
8
For simplification, we consider that only the blocks \(R\{d_7, d_8, d_{12}\}\), \(Spain\{d_2, d_3, d_8, d_9, d_{10}\}\), \(C\{d_2, d_3, d_4\}\) and \(L\{d_{10}, d_{14}\}\) are pruned by the lBCE algorithm, but all blocks in \(\mathcal {B}_{D_1}\) are considered in order to calculate the co-occurrence score between entities.
 
14
\(\lambda \) values greater than 0.4 have produced low-quality results in the conducted experiments and thus are not reported in this paper.
 
15
The “Default” combination (in Table 8) is the only algorithm that is not influenced by the \(\lambda \) parameter.
 
16
These results are highlighted in bold in Tables 91011 and 12.
 
17
These results are highlighted in bold in Tables 9, 10, 11 and 12.
 
Literatur
1.
Zurück zum Zitat Batini C, Scannapieco M (2016) Data quality dimensions. Springer, Cham, pp 21–51MATH Batini C, Scannapieco M (2016) Data quality dimensions. Springer, Cham, pp 21–51MATH
2.
Zurück zum Zitat Batini C, Cappiello C, Francalanci C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv (CSUR) 41(3):16CrossRef Batini C, Cappiello C, Francalanci C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv (CSUR) 41(3):16CrossRef
3.
Zurück zum Zitat Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: Sixth international conference on data mining, ICDM’06. IEEE, pp 87–96 Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: learning to scale up record linkage. In: Sixth international conference on data mining, ICDM’06. IEEE, pp 87–96
4.
Zurück zum Zitat Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, New YorkCrossRef Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, New YorkCrossRef
5.
Zurück zum Zitat Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555CrossRef Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555CrossRef
6.
Zurück zum Zitat Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 475–480 Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 475–480
7.
Zurück zum Zitat Costa G, Manco G, Ortale R (2010) An incremental clustering scheme for data de-duplication. Data Min Knowl Discov 20(1):152–187MathSciNetCrossRef Costa G, Manco G, Ortale R (2010) An incremental clustering scheme for data de-duplication. Data Min Knowl Discov 20(1):152–187MathSciNetCrossRef
8.
Zurück zum Zitat Covell M, Baluja S (2009) Lsh banding for large-scale retrieval with memory and recall constraints. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2009. IEEE, pp 1865–1868 Covell M, Baluja S (2009) Lsh banding for large-scale retrieval with memory and recall constraints. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2009. IEEE, pp 1865–1868
9.
Zurück zum Zitat De Vries T, Ke H, Chawla S, Christen P (2009) Robust record linkage blocking using suffix arrays. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 305–314 De Vries T, Ke H, Chawla S, Christen P (2009) Robust record linkage blocking using suffix arrays. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 305–314
10.
Zurück zum Zitat do Nascimento DC, Pires CES, Mestre DG (2018) Heuristic-based approaches for speeding up incremental record linkage. J Syst Softw 137:335–354CrossRef do Nascimento DC, Pires CES, Mestre DG (2018) Heuristic-based approaches for speeding up incremental record linkage. J Syst Softw 137:335–354CrossRef
11.
Zurück zum Zitat Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N (2018) Distributed representations of tuples for entity resolution. Proc VLDB Endow 11(11):1454–1467CrossRef Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N (2018) Distributed representations of tuples for entity resolution. Proc VLDB Endow 11(11):1454–1467CrossRef
12.
Zurück zum Zitat Fisher J, Christen P, Wang Q, Rahm E (2015) A clustering-based framework to control block sizes for entity resolution. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288 Fisher J, Christen P, Wang Q, Rahm E (2015) A clustering-based framework to control block sizes for entity resolution. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288
13.
Zurück zum Zitat Ganganath N, Cheng CT, Chi KT (2014) Data clustering with cluster size constraints using a modified \(k\)-means algorithm. In: 2014 International conference on cyber-enabled distributed computing and knowledge discovery (CyberC). IEEE, pp 158–161 Ganganath N, Cheng CT, Chi KT (2014) Data clustering with cluster size constraints using a modified \(k\)-means algorithm. In: 2014 International conference on cyber-enabled distributed computing and knowledge discovery (CyberC). IEEE, pp 158–161
14.
Zurück zum Zitat Giraud-Carrier C, Goodliffe J, Jones BM, Cueva S (2015) Effective record linkage for mining campaign contribution data. Knowl Inf Syst 45(2):389–416CrossRef Giraud-Carrier C, Goodliffe J, Jones BM, Cueva S (2015) Effective record linkage for mining campaign contribution data. Knowl Inf Syst 45(2):389–416CrossRef
15.
Zurück zum Zitat Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC). IEEE, pp 000618–000624 Gomes Mestre D, Pires CES (2013) Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE symposium on computers and communications (ISCC). IEEE, pp 000618–000624
16.
Zurück zum Zitat Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708CrossRef Gruenheid A, Dong XL, Srivastava D (2014) Incremental record linkage. Proc VLDB Endow 7(9):697–708CrossRef
17.
Zurück zum Zitat Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293CrossRef Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293CrossRef
18.
Zurück zum Zitat Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63CrossRef Kolb L, Thor A, Rahm E (2012) Multi-pass sorted neighborhood blocking with mapreduce. Comput Sci Res Dev 27(1):45–63CrossRef
19.
Zurück zum Zitat Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, pp 802–803 Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, pp 802–803
20.
Zurück zum Zitat Malinen MI, Fränti P (2014) Balanced \(k\)-means for clustering. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, pp 32–41 Malinen MI, Fränti P (2014) Balanced \(k\)-means for clustering. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, pp 32–41
21.
Zurück zum Zitat Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647CrossRef Mann W, Augsten N, Bouros P (2016) An empirical evaluation of set similarity join techniques. Proc VLDB Endow 9(9):636–647CrossRef
22.
Zurück zum Zitat Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, pp 981–987 Mestre DG, Pires CE, Nascimento DC (2015) Adaptive sorted neighborhood blocking for entity matching with mapreduce. In: Proceedings of the 30th annual ACM symposium on applied computing. ACM, pp 981–987
23.
Zurück zum Zitat Mestre DG, Pires CES, Nascimento DC (2017) Towards the efficient parallelization of multi-pass adaptive blocking for entity matching. J Parallel Distrib Comput 101:27–40CrossRef Mestre DG, Pires CES, Nascimento DC (2017) Towards the efficient parallelization of multi-pass adaptive blocking for entity matching. J Parallel Distrib Comput 101:27–40CrossRef
24.
Zurück zum Zitat Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI, pp 440–445 Michelson M, Knoblock CA (2006) Learning blocking schemes for record linkage. In: AAAI, pp 440–445
25.
Zurück zum Zitat Nascimento DC, Pires CE, Mestre D (2015) Data quality monitoring of cloud databases based on data quality SLAs. In: Trovati M, Hill R, Anjum A, Zhu S, Liu L (eds) Big-data analytics and cloud computing. Springer, Cham, pp 3–20CrossRef Nascimento DC, Pires CE, Mestre D (2015) Data quality monitoring of cloud databases based on data quality SLAs. In: Trovati M, Hill R, Anjum A, Zhu S, Liu L (eds) Big-data analytics and cloud computing. Springer, Cham, pp 3–20CrossRef
26.
Zurück zum Zitat Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946–1960CrossRef Papadakis G, Koutrika G, Palpanas T, Nejdl W (2014) Meta-blocking: taking entity resolutionto the next level. IEEE Trans Knowl Data Eng 26(8):1946–1960CrossRef
27.
Zurück zum Zitat Papadakis G, Papastefanatos G, Koutrika G (2014) Supervised meta-blocking. Proc VLDB Endow 7(14):1929–1940CrossRef Papadakis G, Papastefanatos G, Koutrika G (2014) Supervised meta-blocking. Proc VLDB Endow 7(14):1929–1940CrossRef
28.
Zurück zum Zitat Papenbrock T, Heise A, Naumann F (2015) Progressive duplicate detection. IEEE Trans Knowl Data Eng 27(5):1316–1329CrossRef Papenbrock T, Heise A, Naumann F (2015) Progressive duplicate detection. IEEE Trans Knowl Data Eng 27(5):1316–1329CrossRef
29.
Zurück zum Zitat Ramadan B, Christen P, Liang H, Gayler RW (2015) Dynamic sorted neighborhood indexing for real-time entity resolution. J Data Inf Qual 6(4):15 Ramadan B, Christen P, Liang H, Gayler RW (2015) Dynamic sorted neighborhood indexing for real-time entity resolution. J Data Inf Qual 6(4):15
30.
Zurück zum Zitat Ranbaduge T, Vatsalan D, Christen P (2015) Clustering-based scalable indexing for multi-party privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 549–561 Ranbaduge T, Vatsalan D, Christen P (2015) Clustering-based scalable indexing for multi-party privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 549–561
31.
Zurück zum Zitat Ranbaduge T, Vatsalan D, Christen P, Verykios V (2016) Hashing-based distributed multi-party blocking for privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 415–427 Ranbaduge T, Vatsalan D, Christen P, Verykios V (2016) Hashing-based distributed multi-party blocking for privacy-preserving record linkage. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 415–427
32.
Zurück zum Zitat Rebollo-Monedero D, Solé M, Nin J, Forné J (2013) A modification of the \(k\)-means method for quasi-unsupervised learning. Knowl Based Syst 37:176–185CrossRef Rebollo-Monedero D, Solé M, Nin J, Forné J (2013) A modification of the \(k\)-means method for quasi-unsupervised learning. Knowl Based Syst 37:176–185CrossRef
33.
Zurück zum Zitat Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969CrossRef Vatsalan D, Christen P, Verykios VS (2013) A taxonomy of privacy-preserving record linkage techniques. Inf Syst 38(6):946–969CrossRef
34.
Zurück zum Zitat Vatsalan D, Christen P (2013) Sorted nearest neighborhood clustering for efficient private blocking. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 341–352 Vatsalan D, Christen P (2013) Sorted nearest neighborhood clustering for efficient private blocking. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 341–352
35.
Zurück zum Zitat Verykios VS, Karakasidis A, Mitrogiannis VK (2009) Privacy preserving record linkage approaches. Int J Data Min Model Manag 1(2):206–221MATH Verykios VS, Karakasidis A, Mitrogiannis VK (2009) Privacy preserving record linkage approaches. Int J Data Min Model Manag 1(2):206–221MATH
36.
Zurück zum Zitat Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232 Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232
37.
Zurück zum Zitat Whang SE, Marmaros D, Garcia-Molina H (2013) Pay-as-you-go entity resolution. IEEE Trans Knowl Data Eng 25(5):1111–1124CrossRef Whang SE, Marmaros D, Garcia-Molina H (2013) Pay-as-you-go entity resolution. IEEE Trans Knowl Data Eng 25(5):1111–1124CrossRef
38.
Zurück zum Zitat Yan S, Lee D, Kan MY, Giles LC (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 185–194 Yan S, Lee D, Kan MY, Giles LC (2007) Adaptive sorted neighborhood methods for efficient record linkage. In: Proceedings of the 7th ACM/IEEE-CS joint conference on digital libraries. ACM, pp 185–194
39.
Zurück zum Zitat Zhu S, Wang D, Li T (2010) Data clustering with size constraints. Knowl Based Syst 23(8):883–889CrossRef Zhu S, Wang D, Li T (2010) Data clustering with size constraints. Knowl Based Syst 23(8):883–889CrossRef
Metadaten
Titel
Exploiting block co-occurrence to control block sizes for entity resolution
verfasst von
Dimas Cassimiro Nascimento
Carlos Eduardo Santos Pires
Demetrio Gomes Mestre
Publikationsdatum
26.03.2019
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2020
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-019-01347-0

Weitere Artikel der Ausgabe 1/2020

Knowledge and Information Systems 1/2020 Zur Ausgabe