Skip to main content

2025 | OriginalPaper | Chapter

DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset Search

Authors : Qiaosheng Chen, Xiao Zhou, Zhiyang Zhang, Gong Cheng

Published in: The Semantic Web – ISWC 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

loading …


With the vast influx of open data on the Web, dataset search has become a trending research problem which is crucial to data discovery and reuse. Existing methods for dataset search either employ only the unstructured metadata of datasets but ignore their actual data, or cater to structured data in a single format such as RDF despite the diverse formats of open data. In this paper, to address the magnitude of large datasets, we decompose RDF data into data chunks, and then, to accommodate big chunks to the limited input capacity of dense ranking models based on pre-trained language models, we propose a multi-chunk summarization method that extracts representative data from representative chunks. Moreover, to handle heterogeneous data formats beyond RDF, we transform other formats into chunks to be processed in a uniform way. Experiments on two test collections for dataset search demonstrate the effectiveness of our dense ranking over summarized data chunks.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"


Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"


Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe


Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"


Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

go back to reference Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, pp. 94–143. PWS Publishing Co., USA (1996) Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems, pp. 94–143. PWS Publishing Co., USA (1996)
go back to reference Kroll, H., Nagel, D., Balke, W.T.: Bafrec: balancing frequency and rarity for entity characterization in open linked data. In: EYRE 2018 (2018) Kroll, H., Nagel, D., Balke, W.T.: Bafrec: balancing frequency and rarity for entity characterization in open linked data. In: EYRE 2018 (2018)
go back to reference Li, C., Yates, A., MacAvaney, S., He, B., Sun, Y.: PARADE: passage representation aggregation for document reranking. CoRR abs/2008.09093 (2020) Li, C., Yates, A., MacAvaney, S., He, B., Sun, Y.: PARADE: passage representation aggregation for document reranking. CoRR abs/2008.09093 (2020)
go back to reference Li, M., Popa, D.N., Chagnon, J., Cinar, Y.G., Gaussier, É.: The power of selecting key blocks with local pre-ranking for long document information retrieval. ACM Trans. Inf. Syst. 41(3), 73:1–73:35 (2023). Li, M., Popa, D.N., Chagnon, J., Cinar, Y.G., Gaussier, É.: The power of selecting key blocks with local pre-ranking for long document information retrieval. ACM Trans. Inf. Syst. 41(3), 73:1–73:35 (2023). https://​doi.​org/​10.​1145/​3568394
go back to reference Nguyen, P., et al.: Nii table linker at the ntcir-15 data search task: Re-ranking with pre-trained contextualized embeddings, data content, entity-centric, and cluster-based approaches. In: NTCIR 2020 (2020) Nguyen, P., et al.: Nii table linker at the ntcir-15 data search task: Re-ranking with pre-trained contextualized embeddings, data content, entity-centric, and cluster-based approaches. In: NTCIR 2020 (2020)
go back to reference Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Workshop on Cognitive Computation (NIPS 2016), vol. 1773 (2016) Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: Workshop on Cognitive Computation (NIPS 2016), vol. 1773 (2016)
go back to reference Nogueira, R.F., Cho, K.: Passage re-ranking with BERT. CoRR abs/1901.04085 (2019) Nogueira, R.F., Cho, K.: Passage re-ranking with BERT. CoRR abs/1901.04085 (2019)
go back to reference Okamoto, T., Miyamori, H.: Ksu systems at the ntcir-15 data search task. In: NTCIR 2020 (2020) Okamoto, T., Miyamori, H.: Ksu systems at the ntcir-15 data search task. In: NTCIR 2020 (2020)
go back to reference Wang, X., Cheng, G.: A survey on extractive knowledge graph summarization: applications, approaches, evaluation, and future directions. In: IJCAI 2024 (2024) Wang, X., Cheng, G.: A survey on extractive knowledge graph summarization: applications, approaches, evaluation, and future directions. In: IJCAI 2024 (2024)
go back to reference Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: ICLR 2021 (2021) Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: ICLR 2021 (2021)
DUNKS: Chunking and Summarizing Large and Heterogeneous Data for Dataset Search
Qiaosheng Chen
Xiao Zhou
Zhiyang Zhang
Gong Cheng
Copyright Year

Premium Partner