Skip to main content
Top
Published in: Journal on Data Semantics 4/2016

01-12-2016 | Original Article

Providing Insight into Data Source Topics

Authors: Sonia Bergamaschi, Davide Ferrari, Francesco Guerra, Giovanni Simonini, Yannis Velegrakis

Published in: Journal on Data Semantics | Issue 4/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

A fundamental service for the exploitation of the modern large data sources that are available online is the ability to identify the topics of the data that they contain. Unfortunately, the heterogeneity and lack of centralized control makes it difficult to identify the topics directly from the actual values used in the sources. We present an approach that generates signatures of sources that are matched against a reference vocabulary of concepts through the respective signature to generate a description of the topics of the source in terms of this reference vocabulary. The reference vocabulary may be provided ready, may be created manually, or may be created by applying our signature-generated algorithm over a well-curated data source with a clear identification of topics. In our particular case, we have used DBpedia for the creation of the vocabulary, since it is one of the largest known collections of entities and concepts. The signatures are generated by exploiting the entropy and the mutual information of the attributes of the sources to generate semantic identifiers of the various attributes, which combined together form a unique signature of the concepts (i.e. the topics) of the source. The generation of the identifiers is based on the entropy of the values of the attributes; thus, they are independent of naming heterogeneity of attributes or tables. Although the use of traditional information-theoretical quantities such as entropy and mutual information is not new, they may become untrustworthy due to their sensitivity to overfitting, and require an equal number of samples used to construct the reference vocabulary. To overcome these limitations, we normalize and use pseudo-additive entropy measures, which automatically downweight the role of vocabulary items and property values with very low frequencies, resulting in a more stable solution than the traditional counterparts. We have materialized our theory in a system called WHATSIT and we experimentally demonstrate its effectiveness.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
3
For the reference ontology, the maximum entropies and entropy variances are stored in \(GE_{index}\), while, for the target source, these measures have to be computed at runtime.
 
5
Even if the table shows the analysis performed on only few properties belonging to three classes, we performed the experiment over 50\(+\) properties belonging to 10\(+\) classes obtaining results entirely similar to the one shown.
 
6
For the sake of simplicity, the table shows the analysis performed on only few classes. Nevertheless, we performed the experiment over 50+ classes and the results showed trends similar to the ones represented.
 
Literature
1.
go back to reference Balakrishnan S, Halevy AY, Harb B, Lee H, Madhavan J, Rostamizadeh A, Shen W, Wilder K, Wu F, Yu C (2015) Applying webtables in practice. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.cidrdb.org Balakrishnan S, Halevy AY, Harb B, Lee H, Madhavan J, Rostamizadeh A, Shen W, Wilder K, Wu F, Yu C (2015) Applying webtables in practice. In: CIDR 2015, seventh biennial conference on innovative data systems research, Asilomar, CA, USA, January 4–7, 2015, online proceedings. www.​cidrdb.​org
2.
go back to reference Bergamaschi S, Domnori E, Guerra F, Orsini M, Trillo-Lado R, Velegrakis Y (2010) Keymantic: semantic keyword-based searching in data integration systems. PVLDB 3(2):1637–1640 Bergamaschi S, Domnori E, Guerra F, Orsini M, Trillo-Lado R, Velegrakis Y (2010) Keymantic: semantic keyword-based searching in data integration systems. PVLDB 3(2):1637–1640
3.
go back to reference Bergamaschi S, Ferrari D, Guerra F, Simonini G (2014) Discovering the topics of a data source: a statistical approach. In: Surfacing the Deep and the Social Web (SDSW) workshop held at international semantic web conference Bergamaschi S, Ferrari D, Guerra F, Simonini G (2014) Discovering the topics of a data source: a statistical approach. In: Surfacing the Deep and the Social Web (SDSW) workshop held at international semantic web conference
4.
go back to reference Bergamaschi S, Guerra F, Interlandi M, Lado RT, Velegrakis Y (2016) Combining user and database perspective for solving keyword queries over relational databases. Inf Syst 55:1–19CrossRef Bergamaschi S, Guerra F, Interlandi M, Lado RT, Velegrakis Y (2016) Combining user and database perspective for solving keyword queries over relational databases. Inf Syst 55:1–19CrossRef
5.
go back to reference Bergamaschi S, Sartori C, Guerra F, Orsini M (2007) Extracting relevant attribute values for improved search. IEEE Int Comput 11(5):26–35CrossRef Bergamaschi S, Sartori C, Guerra F, Orsini M (2007) Extracting relevant attribute values for improved search. IEEE Int Comput 11(5):26–35CrossRef
7.
go back to reference Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1(1):9–36CrossRef Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1(1):9–36CrossRef
8.
go back to reference Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. SIGMOD Rec 35(3):34–41CrossRef Choi N, Song I-Y, Han H (2006) A survey on ontology mapping. SIGMOD Rec 35(3):34–41CrossRef
9.
11.
go back to reference Han L, Finin T, Joshi A (2012) Schema-free structured querying of dbpedia data. In: Chen XW, Lebanon G, Wang H, Zaki MJ (eds) CIKM, pp 2090–2093. ACM Han L, Finin T, Joshi A (2012) Schema-free structured querying of dbpedia data. In: Chen XW, Lebanon G, Wang H, Zaki MJ (eds) CIKM, pp 2090–2093. ACM
12.
go back to reference Havrda J, Charvát F (1967) Quantification method of classification processes. Concept of structural \(a\)-entropy. Kybernetika 3(1):30–35MathSciNetMATH Havrda J, Charvát F (1967) Quantification method of classification processes. Concept of structural \(a\)-entropy. Kybernetika 3(1):30–35MathSciNetMATH
13.
go back to reference Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Halevy AY, Ives ZG, Doan AH (eds) SIGMOD conference, pp 205–216. ACM Kang J, Naughton JF (2003) On schema matching with opaque column names and data values. In: Halevy AY, Ives ZG, Doan AH (eds) SIGMOD conference, pp 205–216. ACM
14.
go back to reference Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1):484–493 Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1):484–493
15.
16.
go back to reference Oren E, Delbru R, Catasta M, Cyganiak R, Stenzhorn H, Tummarello G (2008) Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1):37–52CrossRef Oren E, Delbru R, Catasta M, Cyganiak R, Stenzhorn H, Tummarello G (2008) Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1):37–52CrossRef
17.
go back to reference Rahm E (2011) Towards large-scale schema and ontology matching. In: Schema matching and mapping, pp 3–27 Rahm E (2011) Towards large-scale schema and ontology matching. In: Schema matching and mapping, pp 3–27
18.
go back to reference Schopman BAC, Wang S, Isaac A, Schlobach S (2012) Instance-based ontology matching by instance enrichment. J Data Semant 1(4):219–236CrossRef Schopman BAC, Wang S, Isaac A, Schlobach S (2012) Instance-based ontology matching by instance enrichment. J Data Semant 1(4):219–236CrossRef
19.
go back to reference Shvaiko P, Euzenat J (2013) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng 25(1):158–176CrossRef Shvaiko P, Euzenat J (2013) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng 25(1):158–176CrossRef
20.
go back to reference Srivastava D, Velegrakis Y (2007) Intensional associations between data and metadata. In: SIGMOD, pp 401–412 Srivastava D, Velegrakis Y (2007) Intensional associations between data and metadata. In: SIGMOD, pp 401–412
22.
go back to reference Van der Vaart AW (2000) Asymptotic statistics. Cambridge university press, CambridgeMATH Van der Vaart AW (2000) Asymptotic statistics. Cambridge university press, CambridgeMATH
23.
go back to reference Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: SIGIR 2006: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA, August 6–11, 2006, pp 178–185 Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: SIGIR 2006: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, Seattle, Washington, USA, August 6–11, 2006, pp 178–185
24.
25.
go back to reference Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645 Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
26.
go back to reference Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910 Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
27.
go back to reference Yu C, Jagadish HV (2006) Schema summarization. In: Proceedings of the 32nd international conference on very large data bases, Seoul, Korea, September 12–15, 2006, pp 319–330 Yu C, Jagadish HV (2006) Schema summarization. In: Proceedings of the 32nd international conference on very large data bases, Seoul, Korea, September 12–15, 2006, pp 319–330
28.
go back to reference Zhang X, Cheng G, Qu Y (2007) Ontology summarization based on rdf sentence graph. In: proceedings of the 16th international conference on world wide web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007, pp 707–716 Zhang X, Cheng G, Qu Y (2007) Ontology summarization based on rdf sentence graph. In: proceedings of the 16th international conference on world wide web, WWW 2007, Banff, Alberta, Canada, May 8–12, 2007, pp 707–716
Metadata
Title
Providing Insight into Data Source Topics
Authors
Sonia Bergamaschi
Davide Ferrari
Francesco Guerra
Giovanni Simonini
Yannis Velegrakis
Publication date
01-12-2016
Publisher
Springer Berlin Heidelberg
Published in
Journal on Data Semantics / Issue 4/2016
Print ISSN: 1861-2032
Electronic ISSN: 1861-2040
DOI
https://doi.org/10.1007/s13740-016-0063-6

Premium Partner