Skip to main content
Top
Published in: Journal on Data Semantics 3-4/2021

06-07-2021 | Original Article

DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation

Authors: Johny Moreira, Luciano Barbosa

Published in: Journal on Data Semantics | Issue 3-4/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Knowledge bases allow data organization and exploration, making easier the data semantic understanding and its use by machines. Traditional strategies for knowledge base construction and augmentation have mostly relied on manual effort or automatic extraction of content from structured and semi-structured sources. In this work, we present DeepEx, a system that autonomously extracts missing attributes of entities in knowledge bases from unstructured text. We use Wikipedia as data source. Given entities on Wikipedia represented by their articles (text and infobox), DeepEx uses a classifier to detect sentences in the articles mentioning the possible missing attributes of the entities and then employs a deep-learning extraction model on those sentences to identify the attributes. The sentence classifier and attribute extractor are built with labels automatically produced by a weak supervision approach using infobox structured information as supervision source. We have compared our strategy with previous approaches to this problem on 29 different attributes from 4 domains. The results showed that our extraction pipeline achieved statistically superior performance in comparison with some baselines and variations of our approach.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20 Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20
2.
go back to reference Balog K (2018) Entity-oriented search. the information retrieval series. Springer International Publishing, New York Balog K (2018) Entity-oriented search. the information retrieval series. Springer International Publishing, New York
3.
go back to reference Banerjee S, Tsioutsiouliklis K (2018) Relation extraction using multi-encoder lstm network on a distant supervised dataset. In: IEEE 12th International Conference on Semantic Computing Banerjee S, Tsioutsiouliklis K (2018) Relation extraction using multi-encoder lstm network on a distant supervised dataset. In: IEEE 12th International Conference on Semantic Computing
4.
go back to reference Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the International Conference on Management of Data Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: A collaboratively created graph database for structuring human knowledge. Proceedings of the International Conference on Management of Data
5.
go back to reference Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16:1190MathSciNetCrossRef Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16:1190MathSciNetCrossRef
6.
go back to reference Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRef Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27CrossRef
7.
go back to reference Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370CrossRef Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comput Linguist 4:357–370CrossRef
8.
go back to reference Cohen WW, Ravikumar P, Fienberg SE, et al. (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the International Conference on Information Integration on the Web, p 73–78 Cohen WW, Ravikumar P, Fienberg SE, et al. (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the International Conference on Information Integration on the Web, p 73–78
9.
go back to reference Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293MATH Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293MATH
10.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics
11.
go back to reference Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining pp 601–610 Dong X, Gabrilovich E, Heitz G, Horn W, Lao N, Murphy K, Strohmann T, Sun S, Zhang W (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining pp 601–610
12.
go back to reference Dozat T (2016) Incorporating nesterov momentum into adam. International Conference on Learning Representations Workshop Dozat T (2016) Incorporating nesterov momentum into adam. International Conference on Learning Representations Workshop
13.
go back to reference Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J, Schlaefer N, Welty C (2010) Building watson: an overview of the DeepQA project. AI Magazine 31:59–79CrossRef Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock JW, Nyberg E, Prager J, Schlaefer N, Welty C (2010) Building watson: an overview of the DeepQA project. AI Magazine 31:59–79CrossRef
14.
go back to reference Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610CrossRef Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610CrossRef
15.
go back to reference Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24:8–12CrossRef Halevy A, Norvig P, Pereira F (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24:8–12CrossRef
16.
go back to reference Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Market 36:20–38CrossRef Hartmann J, Huppertz J, Schamp C, Heitmann M (2019) Comparing automated text classification methods. Int J Res Market 36:20–38CrossRef
17.
go back to reference Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing
18.
go back to reference Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of the 18th International Conference on Machine Learning Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of the 18th International Conference on Machine Learning
19.
go back to reference Lange D, Böhm C, Naumann F (2010) Extracting structured information from wikipedia articles to populate infoboxes. In: Proceeding of the 19th ACM International Conference on Information and Knowledge Management Lange D, Böhm C, Naumann F (2010) Extracting structured information from wikipedia articles to populate infoboxes. In: Proceeding of the 19th ACM International Conference on Information and Knowledge Management
20.
go back to reference Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195CrossRef Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, Van Kleef P, Auer S et al (2015) Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2):167–195CrossRef
21.
go back to reference Lockard C, Dong XL, Einolghozati A, Shiralkar P (2018) Ceres: Distantly supervised relation extraction from the semi-structured web. Proceeding VLDB Endowment Lockard C, Dong XL, Einolghozati A, Shiralkar P (2018) Ceres: Distantly supervised relation extraction from the semi-structured web. Proceeding VLDB Endowment
22.
go back to reference Lockard C, Shiralkar P, Dong XL, Hajishirzi H (2020) ZeroShotCeres: Zero-shot relation extraction from semi-structured webpages. In: Proceeding of the 58th Annual Meeting of the Association for Computational Linguistics Lockard C, Shiralkar P, Dong XL, Hajishirzi H (2020) ZeroShotCeres: Zero-shot relation extraction from semi-structured webpages. In: Proceeding of the 58th Annual Meeting of the Association for Computational Linguistics
23.
go back to reference Martinez-Rodriguez JL, Hogan A, Lopez-Arevalo I (2020) Information Extraction meets the Semantic Web: A Survey, vol 11 Martinez-Rodriguez JL, Hogan A, Lopez-Arevalo I (2020) Information Extraction meets the Semantic Web: A Survey, vol 11
24.
go back to reference Min B, Grishman R, Wan L, Wang C, Gondek D (2013) Distant supervision for relation extraction with an incomplete knowledge base. Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Min B, Grishman R, Wan L, Wang C, Gondek D (2013) Distant supervision for relation extraction with an incomplete knowledge base. Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
25.
go back to reference Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, vol 2 Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceeding of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, vol 2
26.
go back to reference Nickel M, Tresp V, Kriegel HP (2012) Factorizing yago: Scalable machine learning for linked data. In: Proceeding of the 21st International Conference on World Wide Web Nickel M, Tresp V, Kriegel HP (2012) Factorizing yago: Scalable machine learning for linked data. In: Proceeding of the 21st International Conference on World Wide Web
27.
go back to reference Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:12 Paulheim H (2016) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8:12
28.
go back to reference Paulheim H (2017) Data-driven joint debugging of the dbpedia mappings and ontology. Semant Web 81:404–418CrossRef Paulheim H (2017) Data-driven joint debugging of the dbpedia mappings and ontology. Semant Web 81:404–418CrossRef
29.
go back to reference Paulheim H, Bizer C (2013) Type inference on noisy rdf data. in the semantic web. Springer, Berlin Paulheim H, Bizer C (2013) Type inference on noisy rdf data. in the semantic web. Springer, Berlin
30.
go back to reference Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst 10:63–86CrossRef Paulheim H, Bizer C (2014) Improving the quality of linked data using statistical distributions. Int J Semant Web Inf Syst 10:63–86CrossRef
31.
go back to reference Pennington J, Socher R, Manning C (2014) Glove: Global Vectors for Word Representation. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing Pennington J, Socher R, Manning C (2014) Glove: Global Vectors for Word Representation. In: Proceeding of the Conference on Empirical Methods in Natural Language Processing
32.
go back to reference Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceeding of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics
33.
go back to reference Qu J, Ouyang D, Hua W, Ye Y, Li X (2018) Distant supervision for neural relation extraction integrated with word attention and property features. Neural Netw 100:59–69CrossRef Qu J, Ouyang D, Hua W, Ye Y, Li X (2018) Distant supervision for neural relation extraction integrated with word attention and property features. Neural Netw 100:59–69CrossRef
34.
go back to reference Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
35.
go back to reference Ratner A, Sa CD, Wu S, Selsam D, Ré C (2016) Data programming: Creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, p 3574–3582 Ratner A, Sa CD, Wu S, Selsam D, Ré C (2016) Data programming: Creating large training sets, quickly. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, p 3574–3582
36.
go back to reference Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C (2020) Snorkel: rapid training data creation with weak supervision. VLDB J 29(2):709–730CrossRef Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C (2020) Snorkel: rapid training data creation with weak supervision. VLDB J 29(2):709–730CrossRef
37.
go back to reference Ristoski P, Gentile AL, Alba A, Gruhl D, Welch S (2020) Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. J Web Semant 600:100546CrossRef Ristoski P, Gentile AL, Alba A, Gruhl D, Welch S (2020) Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. J Web Semant 600:100546CrossRef
38.
go back to reference Sáez T, Hogan A (2018) Automatically generating wikipedia info-boxes from wikidata. In: Companion Proceeding of the The Web Conference 2018, WWW ’18, p 1823–1830 Sáez T, Hogan A (2018) Automatically generating wikipedia info-boxes from wikidata. In: Companion Proceeding of the The Web Conference 2018, WWW ’18, p 1823–1830
39.
go back to reference Sleeman J, Finin T (2013) Type prediction for efficient coreference resolution in heterogeneous semantic graphs. Proceeding of the IEEE 7th International Conferenec on Semantic Computing Sleeman J, Finin T (2013) Type prediction for efficient coreference resolution in heterogeneous semantic graphs. Proceeding of the IEEE 7th International Conferenec on Semantic Computing
40.
go back to reference Sleeman J, Finin T, Joshi A (2015) Topic modeling for RDF graphs. CEUR Workshop Proceeding Sleeman J, Finin T, Joshi A (2015) Topic modeling for RDF graphs. CEUR Workshop Proceeding
41.
go back to reference Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH
42.
go back to reference Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. Proceeding of the 16th International Conference on World Wide Web p 697–706 Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. Proceeding of the 16th International Conference on World Wide Web p 697–706
43.
go back to reference Takamatsu S, Sato I, Nakagawa H (2012) Reducing wrong labels in distant supervision for relation extraction. In: Proceeding of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, pp 721–729 Takamatsu S, Sato I, Nakagawa H (2012) Reducing wrong labels in distant supervision for relation extraction. In: Proceeding of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, pp 721–729
44.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser u, Polosukhin I (2017) Attention is all you need. In: Proceeding of the 31st International Conference on Neural Information Processing Systems, p 6000–6010 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser u, Polosukhin I (2017) Attention is all you need. In: Proceeding of the 31st International Conference on Neural Information Processing Systems, p 6000–6010
45.
go back to reference Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85CrossRef Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85CrossRef
46.
go back to reference Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? probing numeracy in embeddings. In: Proceeding of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 5307–5315 Wallace E, Wang Y, Li S, Singh S, Gardner M (2019) Do NLP models know numbers? probing numeracy in embeddings. In: Proceeding of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp 5307–5315
47.
go back to reference Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 55:192 Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 55:192
48.
go back to reference Wu F, Weld DS (2007) Autonomously semantifying wikipedia. In: Proceeding of the 16th ACM Conference on Information and Knowledge Management, pp 41–50 Wu F, Weld DS (2007) Autonomously semantifying wikipedia. In: Proceeding of the 16th ACM Conference on Information and Knowledge Management, pp 41–50
49.
go back to reference Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13(3):55–75CrossRef Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13(3):55–75CrossRef
50.
go back to reference Yus R, Mulwad V, Finin T, Mena E, et al. (2014) Infoboxer: using statistical and semantic knowledge to help create wikipedia infoboxes. In: 13th International Semantic Web Conference Yus R, Mulwad V, Finin T, Mena E, et al. (2014) Infoboxer: using statistical and semantic knowledge to help create wikipedia infoboxes. In: 13th International Semantic Web Conference
Metadata
Title
DeepEx: A Robust Weak Supervision System for Knowledge Base Augmentation
Authors
Johny Moreira
Luciano Barbosa
Publication date
06-07-2021
Publisher
Springer Berlin Heidelberg
Published in
Journal on Data Semantics / Issue 3-4/2021
Print ISSN: 1861-2032
Electronic ISSN: 1861-2040
DOI
https://doi.org/10.1007/s13740-021-00134-x

Other articles of this Issue 3-4/2021

Journal on Data Semantics 3-4/2021 Go to the issue

Premium Partner