Skip to main content

2025 | OriginalPaper | Buchkapitel

Exploiting Distant Supervision to Learn Semantic Descriptions of Tables with Overlapping Data

verfasst von : Binh Vu, Craig A. Knoblock, Basel Shbita, Fandel Lin

Erschienen in: The Semantic Web – ISWC 2024

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Understanding the semantic structure of tabular data is essential for data integration and discovery. Specifically, the goal is to annotate columns in a tabular source with types and relationships between them using classes and predicates of a target ontology. Previous work that exploits the matches between entities in a knowledge graph and the table data does not perform well for tables with noisy or ambiguous data. A key reason for this poor performance is the limited amount of labeled data to train these methods. To address this problem, we propose a novel distant supervision approach that leverages existing Wikipedia tables and hyperlinks to automatically label tables with their semantic descriptions. Then, we use the labeled dataset to train neural network models to predict the semantic description of a new table. Our empirical evaluation shows that using the automatically labeled dataset provides approximately 5% improvement in column type prediction and 4.5% improvement in column relationship prediction in F1 scores over the state-of-the-art on a large set of real-world tables.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A plain table does not contain any markup such as hyperlinks.
 
2
We normalize a header by masking numbers, removing special characters, etc.
 
3
We use the pretrained all-mpnet-base-v2 model.
 
4
The p-value of the sign test [11, 38] on the accuracies of the two systems is 0.086.
 
Literatur
2.
Zurück zum Zitat Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. 18(109), 1–67 (2017)MathSciNet Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. 18(109), 1–67 (2017)MathSciNet
4.
Zurück zum Zitat Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: ColNet: embedding the semantics of web tables for column type prediction. AAAI 33(01), 29–36 (2019)CrossRef Chen, J., Jiménez-Ruiz, E., Horrocks, I., Sutton, C.: ColNet: embedding the semantics of web tables for column type prediction. AAAI 33(01), 29–36 (2019)CrossRef
5.
Zurück zum Zitat Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations for tabular data. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, California (2019). https://doi.org/10.24963/ijcai.2019/289 Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Learning semantic annotations for tabular data. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, California (2019). https://​doi.​org/​10.​24963/​ijcai.​2019/​289
9.
Zurück zum Zitat Dasoulas, I., Yang, D., Duan, X., Dimou, A.: TorchicTab: semantic table annotation with Wikidata and language models. In: CEUR Workshop Proceedings, pp. 21–37 (2023) Dasoulas, I., Yang, D., Duan, X., Dimou, A.: TorchicTab: semantic table annotation with Wikidata and language models. In: CEUR Workshop Proceedings, pp. 21–37 (2023)
10.
Zurück zum Zitat Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: table understanding through representation learning (2020) Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: table understanding through representation learning (2020)
11.
Zurück zum Zitat Dixon, W.J., Mood, A.M.: The statistical sign test. J. Am. Stat. Assoc. 41(236), 557–566 (1946)CrossRef Dixon, W.J., Mood, A.M.: The statistical sign test. J. Am. Stat. Assoc. 41(236), 557–566 (1946)CrossRef
12.
Zurück zum Zitat Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. In: d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J., Lange, C., Heflin, J. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 260–277. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_16CrossRef Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. In: d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudré-Mauroux, P., Sequeda, J., Lange, C., Heflin, J. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 260–277. Springer, Cham (2017). https://​doi.​org/​10.​1007/​978-3-319-68288-4_​16CrossRef
13.
Zurück zum Zitat Feng, Z.W., et al.: Automatic semantic modeling for structural data source with the prior knowledge from knowledge graph (2021) Feng, Z.W., et al.: Automatic semantic modeling for structural data source with the prior knowledge from knowledge graph (2021)
15.
Zurück zum Zitat Hassanzadeh, O., et al.: Results of SemTab 2023. In: CEUR Workshop Proceedings, vol. 3557, pp. 1–14 (2023) Hassanzadeh, O., et al.: Results of SemTab 2023. In: CEUR Workshop Proceedings, vol. 3557, pp. 1–14 (2023)
18.
Zurück zum Zitat Henriksen, E.G., Khorsid, A.M., Nielsen, E., Stück, A.M., Sørensen, A.S., Pelgrin, O.: Semtex: a hybrid approach for semantic table interpretation (2023) Henriksen, E.G., Khorsid, A.M., Nielsen, E., Stück, A.M., Sørensen, A.S., Pelgrin, O.: Semtex: a hybrid approach for semantic table interpretation (2023)
19.
Zurück zum Zitat Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’19, pp. 1500–1508. Association for Computing Machinery, New York, NY, USA (2019) Hulsebos, M., et al.: Sherlock: a deep learning approach to semantic data type detection. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’19, pp. 1500–1508. Association for Computing Machinery, New York, NY, USA (2019)
20.
Zurück zum Zitat Huynh, V.P., Chabot, Y., Labbé, T., Liu, J., Troncy, R.: From heuristics to language models: a journey through the universe of semantic table interpretation with DAGOBAH. In: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) (2022) Huynh, V.P., Chabot, Y., Labbé, T., Liu, J., Troncy, R.: From heuristics to language models: a journey through the universe of semantic table interpretation with DAGOBAH. In: Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) (2022)
21.
22.
Zurück zum Zitat Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014) Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)
23.
Zurück zum Zitat Korini, K., Peeters, R., Bizer, C.: Sotab: The WDC schema.org table annotation benchmark. In: CEUR Workshop Proceedings, vol. 3320, pp. 14–19. RWTH Aachen (2022) Korini, K., Peeters, R., Bizer, C.: Sotab: The WDC schema.org table annotation benchmark. In: CEUR Workshop Proceedings, vol. 3320, pp. 14–19. RWTH Aachen (2022)
25.
Zurück zum Zitat Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proceedings of the VLDB Endowment, vol. 3, pp. 1338–1347. VLDB Endowment (2010) Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proceedings of the VLDB Endowment, vol. 3, pp. 1338–1347. VLDB Endowment (2010)
26.
Zurück zum Zitat Liu, J., Chabot, Y., Troncy, R., Huynh, V.P., Labbé, T., Monnin, P.: From tabular data to knowledge graphs: a survey of semantic table interpretation tasks and methods. J. Web Semant. 76, 100761 (2023)CrossRef Liu, J., Chabot, Y., Troncy, R., Huynh, V.P., Labbé, T., Monnin, P.: From tabular data to knowledge graphs: a survey of semantic table interpretation tasks and methods. J. Web Semant. 76, 100761 (2023)CrossRef
27.
Zurück zum Zitat Luzuriaga, J., Munoz, E., Rosales-Mendez, H., Hogan, A.: Merging web tables for relation extraction with knowledge graphs. IEEE Trans. Knowl. Data Eng. 1 (2021) Luzuriaga, J., Munoz, E., Rosales-Mendez, H., Hogan, A.: Merging web tables for relation extraction with knowledge graphs. IEEE Trans. Knowl. Data Eng. 1 (2021)
31.
Zurück zum Zitat Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks (2019) Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks (2019)
32.
Zurück zum Zitat Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pp. 1–6. No. Article 10 in WIMS ’15. Association for Computing Machinery, New York, NY, USA (2015) Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, pp. 1–6. No. Article 10 in WIMS ’15. Association for Computing Machinery, New York, NY, USA (2015)
34.
Zurück zum Zitat Suhara, Y., et al.: Annotating columns with pre-trained language models. In: Proceedings of the 2022 International Conference on Management of Data. ACM, New York, NY, USA (2022) Suhara, Y., et al.: Annotating columns with pre-trained language models. In: Proceedings of the 2022 International Conference on Management of Data. ACM, New York, NY, USA (2022)
35.
Zurück zum Zitat Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. J. Web Semant. 37–38, 152–169 (2016)CrossRef Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: Learning the semantics of structured data sources. J. Web Semant. 37–38, 152–169 (2016)CrossRef
36.
Zurück zum Zitat Vu, B., Knoblock, C., Pujara, J.: Learning semantic models of data sources using probabilistic graphical models. In: The World Wide Web Conference. WWW ’19, pp. 1944–1953. Association for Computing Machinery, New York, NY, USA (2019) Vu, B., Knoblock, C., Pujara, J.: Learning semantic models of data sources using probabilistic graphical models. In: The World Wide Web Conference. WWW ’19, pp. 1944–1953. Association for Computing Machinery, New York, NY, USA (2019)
38.
39.
Zurück zum Zitat Zhang, Z.: Effective and efficient semantic table interpretation using TableMiner+. Semant. Web 8(6), 921–957 (2017)CrossRef Zhang, Z.: Effective and efficient semantic table interpretation using TableMiner+. Semant. Web 8(6), 921–957 (2017)CrossRef
Metadaten
Titel
Exploiting Distant Supervision to Learn Semantic Descriptions of Tables with Overlapping Data
verfasst von
Binh Vu
Craig A. Knoblock
Basel Shbita
Fandel Lin
Copyright-Jahr
2025
DOI
https://doi.org/10.1007/978-3-031-77850-6_7