Skip to main content
Erschienen in: New Generation Computing 1/2017

10.01.2017 | Special Feature

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

verfasst von: Yun Zhou, Minlue Wang, Valeriia Haberland, John Howroyd, Sebastian Danicic, J. Mark Bishop

Erschienen in: New Generation Computing | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Probabilistic record linkage is a well established topic in the literature. Fellegi–Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non-match weights for each pair of records. Bayesian network classifiers–naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we extend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on four datasets in terms of the linkage performance (\(F_1\) score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note in conventional PRL-FS method [8], two fields are either matched or unmatched. Thus, the k of \(m_{k,i}\) can be omitted in this case.
 
2
These datasets can be found at http://​yzhou.​github.​io/​.
 
3
Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.
 
4
In each dataset, we only introduce one hierarchical restriction between the name and address field.
 
Literatur
1.
Zurück zum Zitat Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002) Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002)
2.
Zurück zum Zitat de Campos, C.P., Cuccu, M., Corani, G., Zaffalon, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) Probabilistic Graphical Models, pp. 176–189. Springer, Berlin (2014) de Campos, C.P., Cuccu, M., Corani, G., Zaffalon, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) Probabilistic Graphical Models, pp. 176–189. Springer, Berlin (2014)
3.
Zurück zum Zitat de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approx. Reason. 68, 153–163 (2016)MathSciNetCrossRefMATH de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approx. Reason. 68, 153–163 (2016)MathSciNetCrossRefMATH
4.
Zurück zum Zitat Christen, P., Belacic, D.: Automated probabilistic address standardisation and verification. In: Australasian Data Mining Conference (AusDM05), pp. 53–67(2005) Christen, P., Belacic, D.: Automated probabilistic address standardisation and verification. In: Australasian Data Mining Conference (AusDM05), pp. 53–67(2005)
5.
Zurück zum Zitat Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Decis. Making 2(1), 1 (2002)CrossRef Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Decis. Making 2(1), 1 (2002)CrossRef
6.
Zurück zum Zitat Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)CrossRef Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)CrossRef
7.
Zurück zum Zitat Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef
8.
Zurück zum Zitat Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefMATH Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefMATH
9.
Zurück zum Zitat Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)CrossRefMATH Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)CrossRefMATH
10.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef
11.
Zurück zum Zitat Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)MATH Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)MATH
12.
Zurück zum Zitat Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRef Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRef
13.
Zurück zum Zitat Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRef Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRef
14.
Zurück zum Zitat Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp. 293–302 (2007) Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp. 293–302 (2007)
15.
Zurück zum Zitat Leitão, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)CrossRef Leitão, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)CrossRef
16.
Zurück zum Zitat Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi–Sunter probabilistic record linkage method using the Jaro–Winkler string comparator. In: 2014 IEEE-EMBS international conference on biomedical and health informatics (BHI), IEEE, pp. 375–379 (2014) Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi–Sunter probabilistic record linkage method using the Jaro–Winkler string comparator. In: 2014 IEEE-EMBS international conference on biomedical and health informatics (BHI), IEEE, pp. 375–379 (2014)
17.
Zurück zum Zitat Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000) Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
18.
Zurück zum Zitat Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, AUAI Press, pp. 454–461 (2004) Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, AUAI Press, pp. 454–461 (2004)
19.
Zurück zum Zitat Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)CrossRef Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)CrossRef
20.
Zurück zum Zitat Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, pp. 354–359 (1990) Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, pp. 354–359 (1990)
21.
Zurück zum Zitat Winkler, W.E.: The state of record linkage and current research problems. In: Statistical research division, US Census Bureau, Citeseer (1999) Winkler, W.E.: The state of record linkage and current research problems. In: Statistical research division, US Census Bureau, Citeseer (1999)
22.
Zurück zum Zitat Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approx. Reason. 55(5), 1252–1268 (2014)MathSciNetCrossRefMATH Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approx. Reason. 55(5), 1252–1268 (2014)MathSciNetCrossRefMATH
23.
Zurück zum Zitat Zhou, Y., Fenton, N., Hospedales, T., Neil, M.: Probabilistic graphical models parameter learning with transferred prior and constraints. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, AUAI Press, pp. 972–981 (2015a) Zhou, Y., Fenton, N., Hospedales, T., Neil, M.: Probabilistic graphical models parameter learning with transferred prior and constraints. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, AUAI Press, pp. 972–981 (2015a)
24.
Zurück zum Zitat Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.: Extending naive bayes classifier with hierarchy feature level information for record linkage. In: Suzuki, J., Ueno, M. (eds.) Advanced Methodologies for Bayesian Networks, Lecture Notes in Computer Science, vol. 9505, pp. 93–104. Springer, Berlin. doi:10.1007/978-3-319-28379-1_7 (2015b) Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.: Extending naive bayes classifier with hierarchy feature level information for record linkage. In: Suzuki, J., Ueno, M. (eds.) Advanced Methodologies for Bayesian Networks, Lecture Notes in Computer Science, vol. 9505, pp. 93–104. Springer, Berlin. doi:10.​1007/​978-3-319-28379-1_​7 (2015b)
Metadaten
Titel
Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data
verfasst von
Yun Zhou
Minlue Wang
Valeriia Haberland
John Howroyd
Sebastian Danicic
J. Mark Bishop
Publikationsdatum
10.01.2017
Verlag
Ohmsha
Erschienen in
New Generation Computing / Ausgabe 1/2017
Print ISSN: 0288-3635
Elektronische ISSN: 1882-7055
DOI
https://doi.org/10.1007/s00354-016-0008-5

Weitere Artikel der Ausgabe 1/2017

New Generation Computing 1/2017 Zur Ausgabe