Skip to main content
Top

2020 | OriginalPaper | Chapter

ODArchive – Creating an Archive for Structured Data from Open Data Portals

Authors : Thomas Weber, Johann Mitöhner, Sebastian Neumaier, Axel Polleres

Published in: The Semantic Web – ISWC 2020

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We present ODArchive, a large corpus of structured data collected from over 260 Open Data portals worldwide, alongside with curated, integrated metadata. Furthermore we enrich the harvested datasets by heuristic annotations using the type hierarchies in existing Knowledge Graphs. We both (i) present the underlying distributed architecture to scale up regular harvesting and monitoring changes on these portals, and (ii) make the corpus available via different APIs. Moreover, we (iii) analyse the characteristics of tabular data within the corpus. Our APIs can be used to regularly run such analyses or to reproduce experiments from the literature that have worked on static, not publicly available corpora.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
https://​ckan.​org/​, accessed 2020-08-17.
 
2
Overall, historically we monitor and have monitored over 260 portals, however, several of those have gone offline in the meantime or are so-called “harvesting” portals that merely replicate metadata from other portals, for details cf. [14].
 
5
https://​kubernetes.​io/​, accessed 2020-05-22.
 
6
To filter datasets by certain data portals we enriched the descriptions by information collected in the Portal Watch (https://​data.​wu.​ac.​at/​portalwatch/​): we use arc:hasPortal to add this reference. More sophisticated federated queries could be formulated by including the Portal Watch endpoint  [14] which contains additional metadata.
 
7
The resp. information has been extracted from the most recent DBpedia and Wikidata HDT [4] dumps available at http://​www.​rdfhdt.​org/​datasets/​.
 
8
While this needs further investigation, and obviously more sophisticated matching techniques (substrings- or similarity-based), we note that this low percentage seems to hint at the specific textual information in OD tables not necessarily being covered by the more general, encyclopedic knowledge typical in public KGs.
 
9
E.g., “Ja” and “Nein” (German for “yes” and “no”), are labels for entities in Wikidata.
 
Literature
3.
go back to reference Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685 Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 1365–1375. ACM (2019). https://​doi.​org/​10.​1145/​3308558.​3313685
12.
go back to reference Neumaier, S., Umbrich, J.: Measures for assessing the data freshness in open data portals. In: 2nd International Conference on Open and Big Data, OBD 2016, Vienna, Austria, 22–24 August 2016, pp. 17–24. IEEE Computer Society (2016). https://doi.org/10.1109/OBD.2016.10 Neumaier, S., Umbrich, J.: Measures for assessing the data freshness in open data portals. In: 2nd International Conference on Open and Big Data, OBD 2016, Vienna, Austria, 22–24 August 2016, pp. 17–24. IEEE Computer Society (2016). https://​doi.​org/​10.​1109/​OBD.​2016.​10
Metadata
Title
ODArchive – Creating an Archive for Structured Data from Open Data Portals
Authors
Thomas Weber
Johann Mitöhner
Sebastian Neumaier
Axel Polleres
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-62466-8_20

Premium Partner