Skip to main content
Top

2015 | OriginalPaper | Chapter

Semi-supervised Instance Matching Using Boosted Classifiers

Authors : Mayank Kejriwal, Daniel P. Miranker

Published in: The Semantic Web. Latest Advances and New Domains

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Instance matching concerns identifying pairs of instances that refer to the same underlying entity. Current state-of-the-art instance matchers use machine learning methods. Supervised learning systems achieve good performance by training on significant amounts of manually labeled samples. To alleviate the labeling effort, this paper presents a minimally supervised instance matching approach that is able to deliver competitive performance using only 2 % training data and little parameter tuning. As a first step, the classifier is trained in an ensemble setting using boosting. Iterative semi-supervised learning is used to improve the performance of the boosted classifier even further, by re-training it on the most confident samples labeled in the current iteration. Empirical evaluations on a suite of six publicly available benchmarks show that the proposed system outcompetes optimization-based minimally supervised approaches in 1–7 iterations. The system’s average F-Measure is shown to be within 2.5 % of that of recent supervised systems that require more training samples for effective performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Block purging eliminates clusters larger than a threshold value, with the premise that such clusters are the result of (non-discriminative) stop-word tokens [20].
 
3
Note that \(2^7=128\,\%\). To prevent this extra source (28 %) of noise, the seventh iteration of Algorithm  1 sets \(factor\) to 100/64=1.5625. More generally, Algorithm 1 can be implemented to take \(x\) as a parameter, and to enforce \(factor^{num-1}x \le 100\,\%\).
 
4
http://​oaei.​ontologymatching​.​org/​2010/​im/​index.​html. We did not use the 2014 IAEI benchmarks because, at the time of writing, their ground-truths were unavailable, and they were not evaluated by competing instance matching baselines.
 
7
The general F-Measure formula is parametrized by a quantity, \(\beta \). In the case of the \(F_1-Measure\), \(\beta =1\).
 
8
For example, the maximum achievable classification F-Measure on Amazon-GoogleProducts is \(2*83.54*100/(83.54+100)=91.03\,\%\), since maximum achievable recall is the candidate set recall, 83.54 %.
 
9
The exception was Restaurants where the random forest achieved 100 % best FM.
 
10
The reference for this claim is Fig. 3 (on page 6) of the original paper [11].
 
11
Supplemental experimental results are noted on the project website (footnote 2).
 
12
This corresponds to \(2^5=32\,\%\) of the ground-truth (assuming no re-training noise).
 
Literature
1.
go back to reference Bilenko, M., Mooney, R. J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003) Bilenko, M., Mooney, R. J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)
2.
go back to reference Chapelle, O., Schölkopf, B., Zien, A., et al.: Semi-supervised Learning, vol. 2. MIT Press, Cambridge (2006)CrossRef Chapelle, O., Schölkopf, B., Zien, A., et al.: Semi-supervised Learning, vol. 2. MIT Press, Cambridge (2006)CrossRef
3.
go back to reference Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, New York (2012)CrossRef Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, New York (2012)CrossRef
4.
go back to reference Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRef Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRef
5.
go back to reference Christen, P., Churches, T., Hegland, M.: Febrl – a parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004) CrossRef Christen, P., Churches, T., Hegland, M.: Febrl – a parallel open source data linkage system. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 638–647. Springer, Heidelberg (2004) CrossRef
6.
go back to reference Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRef
7.
go back to reference Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995) CrossRef Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995) CrossRef
8.
go back to reference Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: WebDB (2011) Isele, R., Jentzsch, A., Bizer, C.: Efficient multidimensional blocking for link discovery without losing recall. In: WebDB (2011)
9.
go back to reference Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 340–349. IEEE (2013) Kejriwal, M., Miranker, D.P.: An unsupervised algorithm for learning blocking schemes. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 340–349. IEEE (2013)
10.
go back to reference Kejriwal, M., Miranker, D.P.: A two-step blocking scheme learner for scalable link discovery. In: Ontology Matching, p. 49 (2014) Kejriwal, M., Miranker, D.P.: A two-step blocking scheme learner for scalable link discovery. In: Ontology Matching, p. 49 (2014)
11.
go back to reference Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)CrossRef Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010)CrossRef
12.
go back to reference Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002) Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002)
13.
go back to reference Ma, Y., Tran, T., Bicer, V.: Typifier: inferring the type semantics of structured data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 206–217. IEEE (2013) Ma, Y., Tran, T., Bicer, V.: Typifier: inferring the type semantics of structured data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 206–217. IEEE (2013)
14.
go back to reference Ngomo, A.-C.N.: A time-efficient hybrid approach to link discovery. In: Ontology Matching, p. 1 (2011) Ngomo, A.-C.N.: A time-efficient hybrid approach to link discovery. In: Ontology Matching, p. 1 (2011)
15.
go back to reference Ngomo, A.-C.N., Lehmann, J., Auer, S., Höffner, K.: Raven-active learning of link specifications. In: Proceedings of the Sixth International Workshop on Ontology Matching, pp. 25–37. Citeseer (2011) Ngomo, A.-C.N., Lehmann, J., Auer, S., Höffner, K.: Raven-active learning of link specifications. In: Proceedings of the Sixth International Workshop on Ontology Matching, pp. 25–37. Citeseer (2011)
16.
go back to reference Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012) CrossRef Ngomo, A.-C., Lyko, K.: EAGLE: efficient active learning of link specifications using genetic programming. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 149–163. Springer, Heidelberg (2012) CrossRef
17.
go back to reference Ngomo, A.-C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: OM, pp. 25–36 (2013) Ngomo, A.-C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: OM, pp. 25–36 (2013)
18.
go back to reference Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012) CrossRef Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012) CrossRef
19.
go back to reference Nikolov, A., Uren, V., Motta, E., De Roeck, A.: Handling instance coreferencing in the knofuss architecture (2008) Nikolov, A., Uren, V., Motta, E., De Roeck, A.: Handling instance coreferencing in the knofuss architecture (2008)
20.
go back to reference Papadakis, G., Ioannou, E., Palpanas, T., Nejdl, W., et al.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 2665–2682 (2013) Papadakis, G., Ioannou, E., Palpanas, T., Nejdl, W., et al.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 2665–2682 (2013)
21.
go back to reference Rätsch, G., Onoda, T., Müller, K.-R.: Soft margins for adaboost. Mach. Learn. 42(3), 287–320 (2001)CrossRefMATH Rätsch, G., Onoda, T., Müller, K.-R.: Soft margins for adaboost. Mach. Learn. 42(3), 287–320 (2001)CrossRefMATH
22.
go back to reference Rong, S., Niu, X., Xiang, E.W., Wang, H., Yang, Q., Yu, Y.: A machine learning approach for instance matching based on similarity metrics. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 460–475. Springer, Heidelberg (2012) CrossRef Rong, S., Niu, X., Xiang, E.W., Wang, H., Yang, Q., Yu, Y.: A machine learning approach for instance matching based on similarity metrics. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 460–475. Springer, Heidelberg (2012) CrossRef
23.
go back to reference Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., Suter, B.W.: The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Trans. Neural Netw. 1(4), 296–298 (1990)CrossRef Ruck, D.W., Rogers, S.K., Kabrisky, M., Oxley, M.E., Suter, B.W.: The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Trans. Neural Netw. 1(4), 296–298 (1990)CrossRef
24.
go back to reference Scharffe, F., Ferrara, A., Nikolov, A., et al.: Data linking for the semantic web. Int. J. Semant. Web Inf. Syst. 7(3), 46–76 (2011)CrossRef Scharffe, F., Ferrara, A., Nikolov, A., et al.: Data linking for the semantic web. Int. J. Semant. Web Inf. Syst. 7(3), 46–76 (2011)CrossRef
25.
go back to reference Scharffe, F., Liu, Y., Zhou, C.: Rdf-ai: an architecture for rdf datasets matching, fusion and interlink. In: Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR), Pasadena (2009) Scharffe, F., Liu, Y., Zhou, C.: Rdf-ai: an architecture for rdf datasets matching, fusion and interlink. In: Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR), Pasadena (2009)
26.
go back to reference Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014) CrossRef Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Heidelberg (2014) CrossRef
27.
go back to reference Soru, T., Ngomo, A.-C.N.: A comparison of supervised learning classifiers for link discovery. In: Proceedings of the 10th International Conference on Semantic Systems, pp. 41–44. ACM (2014) Soru, T., Ngomo, A.-C.N.: A comparison of supervised learning classifiers for link discovery. In: Proceedings of the 10th International Conference on Semantic Systems, pp. 41–44. ACM (2014)
Metadata
Title
Semi-supervised Instance Matching Using Boosted Classifiers
Authors
Mayank Kejriwal
Daniel P. Miranker
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-18818-8_24