Skip to main content
Erschienen in: Datenbank-Spektrum 2/2019

15.05.2019 | Schwerpunktbeitrag

Using the Semantic Web as a Source of Training Data

verfasst von: Christian Bizer, Anna Primpeli, Ralph Peeters

Erschienen in: Datenbank-Spektrum | Ausgabe 2/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep neural networks are increasingly used for tasks such as entity resolution, sentiment analysis, and information extraction. As the methods are rather training data hungry, it is necessary to use large training sets in order to enable the methods to play their strengths. Millions of websites have started to annotate structured data within HTML pages using the schema.org vocabulary. Popular types of entities that are annotated are products, reviews, events, people, hotels, and other local businesses [12]. These semantic annotations are used by all major search engines to display rich snippets in search results. This is also the main driver behind the wide-scale adoption of the annotation techniques.
This article explores the potential of using semantic annotations from large numbers of websites as training data for supervised entity resolution, sentiment analysis, and information extraction methods. After giving an overview of the types of structured data that are available on the Semantic Web, we focus on the task of product matching in e‑commerce and explain how semantic annotations can be used to gather a large training dataset for product matching. The dataset consists of more than 20 million pairs of offers referring to the same products. The offers were extracted from 43 thousand e‑shops, that provide schema.org annotations including some form of product identifiers, such as manufacturer part numbers (MPNs), global trade item numbers (GTINs), or stock keeping units (SKUs). The dataset, which we offer for public download, is orders of magnitude larger than the Walmart-Amazon [7], Amazon-Google [10], and Abt-Buy [10] datasets that are widely used to evaluate product matching methods. We verify the utility of the dataset as training data by using it to replicate the recent result of Mudgal et al. [15] stating that embeddings and RNNs outperform traditional symbolic matching methods on tasks involving less structured data. After the case study on product data matching, we focus on sentiment analysis and information extraction and discuss how semantic annotations from the Web can be used as training data within both tasks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Fußnoten
6
Examples of pay level domains are for instance amazon.de or https://​www.​ebay.​co.​uk/​.
 
8
Regex applied to each predicate URI for capturing identity revealing properties: .*/(gtin8|gtin12|gtin13|gtin14|sku|mpn|identifier|productID).
 
Literatur
1.
Zurück zum Zitat Achichi M, Cheatham M, Dragisic Z et al (2017) Results of the ontology alignment evaluation initiative 2017. In: Proceedings of the 12th ISWC Workshop on Ontology Matching, pp 61–113 Achichi M, Cheatham M, Dragisic Z et al (2017) Results of the ontology alignment evaluation initiative 2017. In: Proceedings of the 12th ISWC Workshop on Ontology Matching, pp 61–113
3.
Zurück zum Zitat Daskalaki E, Flouris G, Fundulaki I, Tzanina S (2016) Instance matching benchmarks in the era of Linked Data. J Web Semant 39C:1–14CrossRef Daskalaki E, Flouris G, Fundulaki I, Tzanina S (2016) Instance matching benchmarks in the era of Linked Data. J Web Semant 39C:1–14CrossRef
4.
Zurück zum Zitat Deriu J et al (2017) Leveraging large amounts of weakly supervised data for multi-language sentiment classification. In: Proceedings of the 26th International Conference on World Wide Web – WWW ’17. ACM Press, Perth, pp 1045–1052 Deriu J et al (2017) Leveraging large amounts of weakly supervised data for multi-language sentiment classification. In: Proceedings of the 26th International Conference on World Wide Web – WWW ’17. ACM Press, Perth, pp 1045–1052
5.
Zurück zum Zitat Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N (2018) Distributed representations of tuples for entity resolution. Proc VLDB Endow 11:1454–1467CrossRef Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N (2018) Distributed representations of tuples for entity resolution. Proc VLDB Endow 11:1454–1467CrossRef
6.
Zurück zum Zitat Foley J, Bendersky M, Josifovski V (2015) Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 423–432 Foley J, Bendersky M, Josifovski V (2015) Learning to extract local events from the web. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 423–432
7.
Zurück zum Zitat Gokhale C et al (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data – SIGMOD ’14. ACM Press, Snowbird, pp 601–612 Gokhale C et al (2014) Corleone: hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data – SIGMOD ’14. ACM Press, Snowbird, pp 601–612
8.
Zurück zum Zitat Kärle E, Fensel A, Toma I, Fensel D (2016) Why are there more hotels in Tyrol than in Austria? Analyzing schema.org usage in the hotel domain. In: Proceedings of the International Conference on Information and Communication Technologies in Tourism 2016. Springer, Cham, pp 99–112 Kärle E, Fensel A, Toma I, Fensel D (2016) Why are there more hotels in Tyrol than in Austria? Analyzing schema.org usage in the hotel domain. In: Proceedings of the International Conference on Information and Communication Technologies in Tourism 2016. Springer, Cham, pp 99–112
9.
Zurück zum Zitat Konda P et al (2016) Magellan: toward building entity matching management systems over data science stacks. Proc VLDB Endow 9(13):1581–1584CrossRef Konda P et al (2016) Magellan: toward building entity matching management systems over data science stacks. Proc VLDB Endow 9(13):1581–1584CrossRef
10.
Zurück zum Zitat Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493CrossRef Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493CrossRef
11.
Zurück zum Zitat Liu B (2012) Sentiment analysis and opinion mining. Synth Lect Hum Lang Technol 5(1):1–167CrossRef Liu B (2012) Sentiment analysis and opinion mining. Synth Lect Hum Lang Technol 5(1):1–167CrossRef
12.
Zurück zum Zitat Meusel et al (2014) The webdatacommons microdata, RDFa and microformat dataset series. In: Proceedings of the International Semantic Web Conference, pp 277–292 Meusel et al (2014) The webdatacommons microdata, RDFa and microformat dataset series. In: Proceedings of the International Semantic Web Conference, pp 277–292
13.
Zurück zum Zitat Meusel R, Paulheim H (2015) Heuristics for fixing common errors in deployed schema.org microdata. In: The semantic web. Latest advances and new domains. Springer, Cham, pp 152–168CrossRef Meusel R, Paulheim H (2015) Heuristics for fixing common errors in deployed schema.org microdata. In: The semantic web. Latest advances and new domains. Springer, Cham, pp 152–168CrossRef
14.
Zurück zum Zitat Meusel R, Paulheim H (2015) Creating large-scale training and test corpora for extracting structured data from the web. In: Proceedings of third workshop on linked data for information extraction Meusel R, Paulheim H (2015) Creating large-scale training and test corpora for extracting structured data from the web. In: Proceedings of third workshop on linked data for information extraction
15.
Zurück zum Zitat Mudgal S et al (2018) Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data – SIGMOD ’18. ACM Press, Houston, pp 19–34CrossRef Mudgal S et al (2018) Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data – SIGMOD ’18. ACM Press, Houston, pp 19–34CrossRef
16.
Zurück zum Zitat Petrovski P, Bizer C (2017) Extracting attribute-value pairs from product specifications on the web. In: Proceedings of the International Conference on Web Intelligence – WI ’17. ACM Press, Leipzig, pp 558–565CrossRef Petrovski P, Bizer C (2017) Extracting attribute-value pairs from product specifications on the web. In: Proceedings of the International Conference on Web Intelligence – WI ’17. ACM Press, Leipzig, pp 558–565CrossRef
17.
Zurück zum Zitat Petrovski P, Bryl V, Bizer C (2014) Integrating product data from websites offering microdata markup. In: Proceedings of the 23rd International Conference on World Wide Web – WWW ’14 Companion. ACM Press, Seoul, pp 1299–1304CrossRef Petrovski P, Bryl V, Bizer C (2014) Integrating product data from websites offering microdata markup. In: Proceedings of the 23rd International Conference on World Wide Web – WWW ’14 Companion. ACM Press, Seoul, pp 1299–1304CrossRef
18.
Zurück zum Zitat Petrovski P, Primpeli A, Meusel R, Bizer C (2017) The WDC gold standards for product feature extraction and product matching. In: Proceedings of the International Conference on E‑Commerce and Web Technologies. Springer, Cham, pp 73–86 Petrovski P, Primpeli A, Meusel R, Bizer C (2017) The WDC gold standards for product feature extraction and product matching. In: Proceedings of the International Conference on E‑Commerce and Web Technologies. Springer, Cham, pp 73–86
19.
Zurück zum Zitat Qiu D, Barbosa L, Dong XL, Shen Y, Srivastava D (2015) Dexter: large-scale discovery and extraction of product specifications on the web. Proc VLDB Endow 8(13):2194–2205CrossRef Qiu D, Barbosa L, Dong XL, Shen Y, Srivastava D (2015) Dexter: large-scale discovery and extraction of product specifications on the web. Proc VLDB Endow 8(13):2194–2205CrossRef
20.
Zurück zum Zitat Ristoski P, Petrovski P, Mika P, Paulheim H (2018) A machine learning approach for product matching and categorization. Semant Web 9(5):707–728CrossRef Ristoski P, Petrovski P, Mika P, Paulheim H (2018) A machine learning approach for product matching and categorization. Semant Web 9(5):707–728CrossRef
21.
Zurück zum Zitat Rosenthal S, Farra N, Nakov P (2017) SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 502–518CrossRef Rosenthal S, Farra N, Nakov P (2017) SemEval-2017 task 4: sentiment analysis in Twitter. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 502–518CrossRef
22.
Zurück zum Zitat Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR ’15. ACM Press, Santiago, pp 959–962 Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR ’15. ACM Press, Santiago, pp 959–962
23.
Zurück zum Zitat Shah K, Kopru S, Ruvini JD (2018) Neural network based extreme classification and similarity models for product matching. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies. Industry papers, vol 3. Association for Computational Linguistics, New Orleans, pp 8–15 Shah K, Kopru S, Ruvini JD (2018) Neural network based extreme classification and similarity models for product matching. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies. Industry papers, vol 3. Association for Computational Linguistics, New Orleans, pp 8–15
24.
Zurück zum Zitat Suganthan P, Doan A et al (2017) Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp 1431–1446 Suganthan P, Doan A et al (2017) Falcon: scaling up hands-off crowdsourced entity matching to build cloud services. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp 1431–1446
25.
Zurück zum Zitat Tang D et al (2016) Sentiment embeddings with applications to sentiment analysis. IEEE Trans Knowl Data Eng 28(2):496–509CrossRef Tang D et al (2016) Sentiment embeddings with applications to sentiment analysis. IEEE Trans Knowl Data Eng 28(2):496–509CrossRef
Metadaten
Titel
Using the Semantic Web as a Source of Training Data
verfasst von
Christian Bizer
Anna Primpeli
Ralph Peeters
Publikationsdatum
15.05.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
Datenbank-Spektrum / Ausgabe 2/2019
Print ISSN: 1618-2162
Elektronische ISSN: 1610-1995
DOI
https://doi.org/10.1007/s13222-019-00313-y

Weitere Artikel der Ausgabe 2/2019

Datenbank-Spektrum 2/2019 Zur Ausgabe

Community

News

Dissertationen

Dissertationen