nach oben

The VLDB Journal

Erschienen in:

01.12.2016 | Regular Paper

ScaLeKB: scalable learning and inference over large knowledge bases

verfasst von: Yang Chen, Daisy Zhe Wang, Sean Goldberg

Erschienen in: The VLDB Journal | Ausgabe 6/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.

Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.

Vorheriger Artikel Answering why-not and why questions on reverse top-k queries

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://dsr.cise.ufl.edu/projects/probkb-web-scale-probabilistic-knowledge-base.

In Freebase, domains are used to conceptually organize the types. We do not use this terminology elsewhere in the paper.

Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology. ACM (2010)

Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (1993)

Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB (1994)

Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD. ACM (2010)

Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 739–748. IEEE (2008)

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. Springer (2007)

Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. In: IJCAI (2007)

Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd Symposium on Principles of Database Systems. ACM (2013)

Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Proceedings of the 33rd Symposium on Principles of Database Systems. ACM (2014)

10.

Biega, J., Kuzey, E., Suchanek, F.M.: Inside yago2s: a transparent information extraction architecture. In: WWW. International World Wide Web Conferences Steering Committee (2013)

11.

Blog, G.O.: Introducing the knowledge graph: thing, not strings. http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html

12.

Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD. ACM (2008)

13.

Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, volume 5, page 3 (2010)

14.

Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr, E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of WSCM (2010)

15.

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: ACM Sigplan Notices, volume 45, pages 363–375. ACM (2010)

16.

Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In: SIGMOD. ACM (2016)

17.

Chen, Y., Petrovic, M., Clark, M.: Semmemdb: In-database knowledge activation. In: FLAIRS Conference (2014)

18.

Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD Conference, pages 649–660 (2014)

19.

Cheng, Y., Qin, C., Rusu, F.: Glade: big data analytics made easy. In: SIGMOD (2012)

20.

Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)

21.

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRef

22.

Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)

23.

Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S. Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment (2014)

24.

Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014)

25.

Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: The second generation. In: IJCAI (2011)

26.

Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: EMNLP (2011)

27.

Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal (2015)

28.

Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)

29.

Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI (2012)

30.

Gottlob, G., Lee, S.T., Valiant, G., Valiant, P.: Size and treewidth bounds for conjunctive queries. Journal of the ACM (JACM) (2012)

31.

Han, J., Pei, J.: Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD explorations newsletter (2000)

32.

Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., et al.: The madlib analytics library: or mad skills, the sql. VLDB (2012)

33.

Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)MathSciNetCrossRefMATH

34.

Horn, A.: On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic (1951)

35.

Huynh, T.N.: Discriminative learning with markov logic networks. Technical report, DTIC Document (2009)

36.

Joglekar, M., Re, C.: It’s all a matter of degree: Using degree information to optimize multiway joins. Proceedings of the International Conference on Database Theory (ICDT) (2016)

37.

Kersting, K., De Raedt, L.: 1 bayesian logic programming: Theory and tool. Statistical Relational Learning, page 291, (2007)

38.

Khamis, M.A., Ngo, H.Q., Suciu, D.: Computing join queries with functional dependencies. Proceedings of the 32nd Symposium on Principles of Database Systems (2016)

39.

Kok, S.: Structure Learning in Markov Logic Networks. PhD thesis, University of Washington (2010)

40.

Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM (2001)

41.

Kuramochi, M., Karypis, G.: Finding frequent patterns in a large sparse graph*. Data mining and knowledge discovery (2005)

42.

Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of EMNLP (2011)

43.

Lao, N., Subramanya, A., Pereira, F., Cohen, W.W.: Reading the web with learned syntactic-semantic inference rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics (2012)

44.

Li, K., Wang, D.Z., Dobra, A., Dudley, C.: Uda-gist: An in-database framework to unify data-parallel and state-parallel analytics. Proceedings of the VLDB Endowment (2015)

45.

Lin, T., Etzioni, O., et al.: Identifying functional relations in web text. In: EMNLP (2010)

46.

Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB (2012)

47.

Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new parallel framework for machine learning. In: UAI (July 2010)

48.

Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual wikipedias. In: CIDR (2015)

49.

Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-ending learning (2015)

50.

Muggleton, S.: Inductive logic programming: derivations, successes and shortcomings. ACM SIGART Bulletin (1994)

51.

Muggleton, S.: Inverse entailment and progol. New generation computing (1995)

52.

Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms:[extended abstract]. In: Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012)

53.

Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. VLDB (2011)

54.

Niu, F., Zhang, C., Ré, C., Shavlik, J.: Scaling inference for markov logic with a task-decomposition approach. arXiv preprint arXiv:1108.0294 (2011)

55.

Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pages 25–28 (2012)

56.

Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. SIGMOD Record (1995)

57.

Quinlan, J.R.: Learning logical definitions from relations. Machine learning 5(3), 239–266 (1990)

58.

Raghavan, S., Mooney, R.J.: Online inference-rule learning from natural-language extractions. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2013)

59.

Richards, B.L.: Learning relations by bathfinding (1992)

60.

Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1–2), 107–136 (2006)CrossRef

61.

Ritter, A., Downey, D., Soderland, S., Etzioni, O.: It’s a contradiction—no, it’s not: a case study using functional relations. In: EMNLP (2008)

62.

Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: VLDB (1995)

63.

Schoenmackers, S., Etzioni, O., Weld, D.S.: Scaling textual inference to the web. In: EMNLP (2008)

64.

Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: EMNLP (2010)

65.

Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment (2015)

66.

Suchanek, F.M., Abiteboul, S., Senellart, P.: Paris: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment (2011)

67.

Tausend, B.: Representing biases for inductive logic programming. In: Machine Learning: ECML-94. Springer (1994)

68.

Veldhuizen, T.L.: Leapfrog triejoin: A simple, worst-case optimal join algorithm. Proceedings of the International Conference on Database Theory (ICDT) (2014)

69.

Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Communications of the ACM (2014)

70.

Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin (2014)

71.

Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD (2011)

72.

West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., Lin, D.: Knowledge base completion via search-based question answering. In: Proceedings of the 23rd international conference on World wide web. ACM (2014)

73.

Wijaya, D., Talukdar, P.P., Mitchell, T.: Pidgin: ontology alignment using web text as interlingua. In: CIKM (2013)

74.

Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: SIGMOD. ACM (2012)

75.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI. USENIX Association (2012)

76.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10 (2010)

77.

Zeng, Q., Patel, J.M., Page, D.: Quickfoil: scalable inductive logic programming. Proceedings of the VLDB Endowment (2014)

78.

Zhang, C.: DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, UW-Madison (2015)

79.

Zou, L., Chen, L., Özsu, M.T.: Distance-join: Pattern match query in a large graph database. Proceedings of VLDB (2009)

Titel: ScaLeKB: scalable learning and inference over large knowledge bases
verfasst von: Yang Chen
Daisy Zhe Wang
Sean Goldberg
Publikationsdatum: 01.12.2016
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 6/2016
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-016-0444-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2016

Diverse and proportional size-l object summaries using pairwise relevance

SkyAlign: a portable, work-efficient skyline algorithm for multicore and GPU architectures

Efficient discovery of longest-lasting correlation in sequence databases

Answering why-not and why questions on reverse top-k queries

Exemplar queries: a new way of searching

ADS: the adaptive data series index