Skip to main content
Erschienen in: The VLDB Journal 5/2013

01.10.2013 | Special Issue Paper

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

verfasst von: Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux

Erschienen in: The VLDB Journal | Ausgabe 5/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
14
In our experiments, we use 100 ground truth matchings that are discarded later when evaluating the proposed matching approaches.
 
15
We can already see the benefit of having better matchings across data sets for that matter.
 
20
The testset we have created together with the matching results from the crowd is available for download at the page: http://​exascale.​info/​ZenCrowd.
 
21
This is the average accuracy over all entity types reported in Table 3.
 
22
The improvement is statistically significant (t test \(p<0.05\)).
 
23
The test collection we created is available for download at: http://​exascale.​info/​zencrowd/​.
 
29
Our approach is hence similar to Blanco et al. [7], though we do not use BM25F as a ranking function.
 
30
Confidence scores have all been normalized to \([0,1]\) by manually defining a transformation function.
 
Literatur
1.
Zurück zum Zitat Alonso, O., Baeza-Yates, R.A.: Design and implementation of relevance assessments using crowdsourcing. In: ECIR, pp. 153–164 (2011). Alonso, O., Baeza-Yates, R.A.: Design and implementation of relevance assessments using crowdsourcing. In: ECIR, pp. 153–164 (2011).
2.
Zurück zum Zitat Bailey, P., de Vries, A.P., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC (2007) Bailey, P., de Vries, A.P., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC (2007)
3.
Zurück zum Zitat Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2010 entity track. In: TREC (2010) Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2010 entity track. In: TREC (2010)
4.
Zurück zum Zitat Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007) Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI, pp. 2670–2676 (2007)
5.
Zurück zum Zitat Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp. 39–48. ACM, New York (2003). doi:10.1145/956750.956759 Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp. 39–48. ACM, New York (2003). doi:10.​1145/​956750.​956759
6.
Zurück zum Zitat Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H.S., Tran, D.T.: Repeatable and reliable search system evaluation using crowdsourcing. In: SIGIR, pp. 923–932 (2011) Blanco, R., Halpin, H., Herzig, D., Mika, P., Pound, J., Thompson, H.S., Tran, D.T.: Repeatable and reliable search system evaluation using crowdsourcing. In: SIGIR, pp. 923–932 (2011)
7.
Zurück zum Zitat Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: International Semantic Web Conference (ISWC), pp. 83–97 (2011) Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: International Semantic Web Conference (ISWC), pp. 83–97 (2011)
8.
Zurück zum Zitat Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), pp. 554–561 (2008) Bouquet, P., Stoermer, H., Niederee, C., Mana, A.: Entity name system: the backbone of an open and scalable web of data. In: Proceedings of the IEEE International Conference on Semantic Computing (ICSC), pp. 554–561 (2008)
9.
Zurück zum Zitat Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006) Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6 (2006)
10.
Zurück zum Zitat Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006) Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: EACL (2006)
12.
13.
Zurück zum Zitat Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, vol. 2007, pp. 708–716 (2007) Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL, vol. 2007, pp. 708–716 (2007)
14.
Zurück zum Zitat Cudré-Mauroux, P., Aberer, K., Feher, A.: Probabilistic message passing in peer data management systems. In: International Conference on Data Engineering (ICDE) (2006) Cudré-Mauroux, P., Aberer, K., Feher, A.: Probabilistic message passing in peer data management systems. In: International Conference on Data Engineering (ICDE) (2006)
15.
Zurück zum Zitat Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., De Meer, H.: idMesh: graph-based disambiguation of linked data. In: WWW ’09, pp. 591–600. ACM, New York (2009). doi:10.1145/1526709.1526789 Cudré-Mauroux, P., Haghani, P., Jost, M., Aberer, K., De Meer, H.: idMesh: graph-based disambiguation of linked data. In: WWW ’09, pp. 591–600. ACM, New York (2009). doi:10.​1145/​1526709.​1526789
16.
Zurück zum Zitat Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002) Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the ACL (2002)
17.
Zurück zum Zitat Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 469–478. ACM, New York (2012). doi:10.1145/2187836.2187900 Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 469–478. ACM, New York (2012). doi:10.​1145/​2187836.​2187900
18.
Zurück zum Zitat Demartini, G., Iofciu, T., de Vries, A.P.: Overview of the INEX 2009 entity ranking track. In: INEX, pp. 254–264 (2009) Demartini, G., Iofciu, T., de Vries, A.P.: Overview of the INEX 2009 entity ranking track. In: INEX, pp. 254–264 (2009)
19.
Zurück zum Zitat Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39 (1977) Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39 (1977)
20.
Zurück zum Zitat Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Pick-A-Crowd: Tell me what you like, and I’ll tell you what to do. In: WWW’13. ACM, New York (2013) Difallah, D.E., Demartini, G., Cudré-Mauroux, P.: Pick-A-Crowd: Tell me what you like, and I’ll tell you what to do. In: WWW’13. ACM, New York (2013)
21.
Zurück zum Zitat Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96. ACM, New York (2005) Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: SIGMOD, pp. 85–96. ACM, New York (2005)
22.
Zurück zum Zitat Feng, A., Franklin, M.J., Kossmann, D., Kraska, T., Madden, S., Ramesh, S., Wang, A., Xin, R.: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(11), 1387–1390 (2011) Feng, A., Franklin, M.J., Kossmann, D., Kraska, T., Madden, S., Ramesh, S., Wang, A., Xin, R.: CrowdDB: Query Processing with the VLDB Crowd. PVLDB 4(11), 1387–1390 (2011)
23.
Zurück zum Zitat Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pp. 80–88 (2010) Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: Annotating named entities in Twitter data with crowdsourcing. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, pp. 80–88 (2010)
24.
Zurück zum Zitat Getoor, L., Machanavajjhala, A.: Entity resolution: tutorial. In: VLDB (2012) Getoor, L., Machanavajjhala, A.: Entity resolution: tutorial. In: VLDB (2012)
25.
Zurück zum Zitat Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: SIGIR, pp. 725–734 (2011) Haas, K., Mika, P., Tarjan, P., Blanco, R.: Enhanced results for web search. In: SIGIR, pp. 725–734 (2011)
26.
Zurück zum Zitat Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia semantic knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 215–224. ACM, New York (2009). doi:10.1145/1645953.1645983 Han, X., Zhao, J.: Named entity disambiguation by leveraging wikipedia semantic knowledge. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp. 215–224. ACM, New York (2009). doi:10.​1145/​1645953.​1645983
27.
Zurück zum Zitat Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRef Jaro, M.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa. Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRef
28.
Zurück zum Zitat Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: ECIR, pp. 165–176 (2011) Kazai, G.: In search of quality in crowdsourcing for search engine evaluation. In: ECIR, pp. 165–176 (2011)
29.
Zurück zum Zitat Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: SIGIR, pp. 205–214 (2011) Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: SIGIR, pp. 205–214 (2011)
30.
Zurück zum Zitat Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430 (2003) Klein, D., Manning, C.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430 (2003)
31.
Zurück zum Zitat Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450 (2010) Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450 (2010)
32.
Zurück zum Zitat Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2) (2001) Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2) (2001)
33.
Zurück zum Zitat Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966) Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics Doklady, vol. 10, pp. 707–710 (1966)
36.
Zurück zum Zitat Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011) Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)
37.
Zurück zum Zitat Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011) Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics) (2011)
38.
Zurück zum Zitat Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pp. 233–242. ACM, New York (2007). doi: 10.1145/1321440.1321475 Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, pp. 233–242. ACM, New York (2007). doi: 10.​1145/​1321440.​1321475
39.
Zurück zum Zitat Murphy, K.M., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Uncertainty in Artificial Intelligence (UAI) (1999) Murphy, K.M., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Uncertainty in Artificial Intelligence (UAI) (1999)
40.
Zurück zum Zitat On, B., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pp. 496–505 (2007) On, B., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pp. 496–505 (2007)
41.
Zurück zum Zitat Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pp. 53–62. ACM, New York (2012). doi:10.1145/2124295.2124305 Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pp. 53–62. ACM, New York (2012). doi:10.​1145/​2124295.​2124305
42.
Zurück zum Zitat Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW, pp. 771–780 (2010) Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW, pp. 771–780 (2010)
44.
Zurück zum Zitat Shen, W., Wang, J., Luo, P., Wang, M.: Liege: link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, pp. 1424–1432. ACM, New York (2012). doi:10.1145/2339530.2339753 Shen, W., Wang, J., Luo, P., Wang, M.: Liege: link entities in web lists with knowledge base. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’12, pp. 1424–1432. ACM, New York (2012). doi:10.​1145/​2339530.​2339753
45.
Zurück zum Zitat Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for ad-hoc object retrieval. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 125–134. ACM, New York (2012). doi:10.1145/2348283.2348304 Tonon, A., Demartini, G., Cudré-Mauroux, P.: Combining inverted indices and structured search for ad-hoc object retrieval. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 125–134. ACM, New York (2012). doi:10.​1145/​2348283.​2348304
47.
Zurück zum Zitat von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, pp. 55–64. ACM, New York (2006). doi:10.1145/1124772.1124782 von Ahn, L., Liu, R., Blum, M.: Peekaboom: a game for locating objects in images. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, pp. 55–64. ACM, New York (2006). doi:10.​1145/​1124772.​1124782
48.
Zurück zum Zitat Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012) Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
49.
Zurück zum Zitat Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD ’09, pp. 219–232. ACM, New York (2009). doi:10.1145/1559845.1559870 Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, SIGMOD ’09, pp. 219–232. ACM, New York (2009). doi:10.​1145/​1559845.​1559870
50.
Zurück zum Zitat Winkler, W.: The state of record linkage and current research problems. US Census Bureau. In: Statistical Research Division (1999) Winkler, W.: The state of record linkage and current research problems. US Census Bureau. In: Statistical Research Division (1999)
51.
Zurück zum Zitat Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus[RDF]—short and long-tail rdf analytics for massive webs of data. In: International Semantic Web Conference (ISWC), pp. 778–793 (2011) Wylot, M., Pont, J., Wisniewski, M., Cudré-Mauroux, P.: dipLODocus[RDF]—short and long-tail rdf analytics for massive webs of data. In: International Semantic Web Conference (ISWC), pp. 778–793 (2011)
Metadaten
Titel
Large-scale linked data integration using probabilistic reasoning and crowdsourcing
verfasst von
Gianluca Demartini
Djellel Eddine Difallah
Philippe Cudré-Mauroux
Publikationsdatum
01.10.2013
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 5/2013
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-013-0324-z

Weitere Artikel der Ausgabe 5/2013

The VLDB Journal 5/2013 Zur Ausgabe

Premium Partner