Skip to main content
Top

2016 | OriginalPaper | Chapter

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

Authors : Jesús M. Almendros-Jiménez, Alfredo Cuzzocrea

Published in: Hybrid Artificial Intelligent Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper we introduce and experimentally assess SemSynX, a novel technique for supporting similarity analysis of XML data via semantic and syntactic heterogeneity/homogeneity detection. Given two XML trees, SemSynX retrieves a list of semantic and syntactic heterogeneity/homogeneity matches of objects (i.e., elements, values, tags, attributes) occurring in certain paths of the trees. A local score that takes into account the path and value similarity is given for each heterogeneity/homogeneity found. A global score that summarizes the number of equal matches as well as the local scores globally is also provided. The proposed technique is highly customizable, and it permits the specification of thresholds for the requested degree of similarity for paths and values as well as for the degree of relevance for path and value matching. It thus makes possible to “adjust” the similarity analysis depending on the nature of the input XML trees. SemSynX has been implemented in terms of a XQuery library, as to enhance interoperability with other XML processing tools. To complete our analytical contributions, a comprehensive experimental assessment and evaluation of SemSynX over several classes of XML documents is provided.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The use of keys is not mandatory in our approach, but keys are used in the running example to guide similarity search.
 
Literature
1.
go back to reference Aïtelhadj, A., Boughanem, M., Mezghiche, M., Souam, F.: Using structural similarity for clustering XML documents. Knowl. Inf. Syst. 32(1), 109–139 (2012)CrossRef Aïtelhadj, A., Boughanem, M., Mezghiche, M., Souam, F.: Using structural similarity for clustering XML documents. Knowl. Inf. Syst. 32(1), 109–139 (2012)CrossRef
2.
go back to reference Algergawy, A., Mesiti, M., Nayak, R., Saake, G.: XML data clustering: an overview. ACM Comput. Surv. (CSUR) 43(4), 25 (2011)CrossRefMATH Algergawy, A., Mesiti, M., Nayak, R., Saake, G.: XML data clustering: an overview. ACM Comput. Surv. (CSUR) 43(4), 25 (2011)CrossRefMATH
3.
go back to reference Almendros-Jiménez, J.M., Cuzzocrea, A.: Towards flexible similarity analysis of XML data. In: On the Move to Meaningful Internet Systems: OTM 2015 Workshops, Rhodes, Greece, 26–30 October 2015, pp. 573–576 (2015) Almendros-Jiménez, J.M., Cuzzocrea, A.: Towards flexible similarity analysis of XML data. In: On the Move to Meaningful Internet Systems: OTM 2015 Workshops, Rhodes, Greece, 26–30 October 2015, pp. 573–576 (2015)
4.
go back to reference Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41(1), 1 (2008)CrossRef Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. (CSUR) 41(1), 1 (2008)CrossRef
5.
go back to reference Bryl, V., Bizer, C., Isele, R., Verlic, M., Hong, S.G., Jang, S., Yi, M.Y., Choi, K.-S.: Interlinking and knowledge fusion. In: Auer, S., Bryl, V., Tramp, S. (eds.) Linked Open Data. LNCS, vol. 8661, pp. 70–89. Springer, Heidelberg (2014) Bryl, V., Bizer, C., Isele, R., Verlic, M., Hong, S.G., Jang, S., Yi, M.Y., Choi, K.-S.: Interlinking and knowledge fusion. In: Auer, S., Bryl, V., Tramp, S. (eds.) Linked Open Data. LNCS, vol. 8661, pp. 70–89. Springer, Heidelberg (2014)
6.
go back to reference Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and XML. In: Proceedings of the Second International Workshop on Web Dynamics, pp. 35–44 (2002) Cannataro, M., Cuzzocrea, A., Mastroianni, C., Ortale, R., Pugliese, A.: Modeling adaptive hypermedia with an object-oriented approach and XML. In: Proceedings of the Second International Workshop on Web Dynamics, pp. 35–44 (2002)
7.
go back to reference Cannataro, M., Cuzzocrea, A., Pugliese, A., Bucci, V.P.: A probabilistic approach to model adaptive hypermedia systems. In: Proceedings of the First International Workshop for Web Dynamics, pp. 12–30 (2001) Cannataro, M., Cuzzocrea, A., Pugliese, A., Bucci, V.P.: A probabilistic approach to model adaptive hypermedia systems. In: Proceedings of the First International Workshop for Web Dynamics, pp. 12–30 (2001)
8.
go back to reference Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)CrossRef Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S.: XML data fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 297–308. Springer, Heidelberg (2010)CrossRef
9.
go back to reference Costa, G., Cuzzocrea, A., Manco, G., Ortale, R.: Data de-duplication: a review. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 385–412. Springer, Heidelberg (2011)CrossRef Costa, G., Cuzzocrea, A., Manco, G., Ortale, R.: Data de-duplication: a review. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 385–412. Springer, Heidelberg (2011)CrossRef
10.
go back to reference Cuzzocrea, A.: Combining multidimensional user models and knowledge representation and management techniques for making web services knowledge-aware. Web Intell. Agent Syst. 4(3), 289–312 (2006) Cuzzocrea, A.: Combining multidimensional user models and knowledge representation and management techniques for making web services knowledge-aware. Web Intell. Agent Syst. 4(3), 289–312 (2006)
11.
go back to reference Cuzzocrea, A., Puglisi, P.L.: Record linkage in data warehousing: state-of-the-art analysis and research perspectives. In: Database and Expert Systems Applications, DEXA 2011, International Workshops, Toulouse, France, August 29 – September 2 2011, pp. 121–125 (2011) Cuzzocrea, A., Puglisi, P.L.: Record linkage in data warehousing: state-of-the-art analysis and research perspectives. In: Database and Expert Systems Applications, DEXA 2011, International Workshops, Toulouse, France, August 29 – September 2 2011, pp. 121–125 (2011)
12.
go back to reference Do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60. Sociedade Brasileira de Computação (2008) Do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: Proceedings of the 23rd Brazilian Symposium on Databases, pp. 46–60. Sociedade Brasileira de Computação (2008)
13.
go back to reference Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment 2(2), 1654–1655 (2009)CrossRef Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. Proc. VLDB Endowment 2(2), 1654–1655 (2009)CrossRef
14.
go back to reference Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)CrossRef Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)CrossRef
15.
go back to reference Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: IEEE 30th International Conference on Data Engineering (ICDE 2014), pp. 232–243. IEEE (2014) Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: IEEE 30th International Conference on Data Engineering (ICDE 2014), pp. 232–243. IEEE (2014)
16.
go back to reference Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R.: Incremental data fusion based on provenance information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) Buneman Festschrift 2013. LNCS, vol. 8000, pp. 339–365. Springer, Heidelberg (2013)CrossRef Hara, C.S., de Aguiar Ciferri, C.D., Ciferri, R.R.: Incremental data fusion based on provenance information. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) Buneman Festschrift 2013. LNCS, vol. 8000, pp. 339–365. Springer, Heidelberg (2013)CrossRef
17.
go back to reference Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. T. Large-Scale Data- Knowl. Centered Syst. 8, 174–196 (2013) Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. T. Large-Scale Data- Knowl. Centered Syst. 8, 174–196 (2013)
18.
go back to reference Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966) Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol. 10, p. 707 (1966)
19.
go back to reference Lung, C.-H., Sanaullah, M., Cao, Y., Majumdar, S.: Design and performance evaluation of cloud-based XML publish/subscribe services. In: IEEE International Conference on Services Computing, SCC 2014, Anchorage, AK, USA, June 27 – July 2 2014, pp. 583–589 (2014) Lung, C.-H., Sanaullah, M., Cao, Y., Majumdar, S.: Design and performance evaluation of cloud-based XML publish/subscribe services. In: IEEE International Conference on Services Computing, SCC 2014, Anchorage, AK, USA, June 27 – July 2 2014, pp. 583–589 (2014)
20.
go back to reference Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the Joint EDBT/ICDT 2012 Workshops, pp. 116–123. ACM (2012) Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: linked data quality assessment and fusion. In: Proceedings of the Joint EDBT/ICDT 2012 Workshops, pp. 116–123. ACM (2012)
21.
go back to reference Milano, D., Scannapieco, M., Catarci, T.: Using ontologies for XML data cleaning. In: Meersman, R., Tari, Z. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 562–571. Springer, Heidelberg (2005)CrossRef Milano, D., Scannapieco, M., Catarci, T.: Using ontologies for XML data cleaning. In: Meersman, R., Tari, Z. (eds.) OTM-WS 2005. LNCS, vol. 3762, pp. 562–571. Springer, Heidelberg (2005)CrossRef
22.
go back to reference Oliveira, P., de Fatima Rodrigues, M., Henriques, P.R.: An ontology-based approach for data cleaning. In: ICIQ, pp. 307–320 (2006) Oliveira, P., de Fatima Rodrigues, M., Henriques, P.R.: An ontology-based approach for data cleaning. In: ICIQ, pp. 307–320 (2006)
23.
go back to reference Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATH Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefMATH
24.
go back to reference Sundaram, S., Kumar, S.: Madria.: a change detection system for unordered XML data using a relational model. Data Knowl. Eng. 72, 257–284 (2012)CrossRef Sundaram, S., Kumar, S.: Madria.: a change detection system for unordered XML data using a relational model. Data Knowl. Eng. 72, 257–284 (2012)CrossRef
25.
go back to reference Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)CrossRefMATH Tekli, J., Chbeir, R., Yetongnon, K.: An overview on XML similarity: background, current trends and future directions. Comput. Sci. Rev. 3(3), 151–173 (2009)CrossRefMATH
26.
go back to reference Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)CrossRef Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)CrossRef
27.
go back to reference Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 109–109. IEEE (2006) Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 109–109. IEEE (2006)
28.
go back to reference Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau (1999) Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau (1999)
29.
go back to reference Yaguinuma, C.A., Afonso, G.F., Ferraz, V., Borges, S., Santos, M.T.: A fuzzy ontology-based semantic data integration system. J. Inf. Knowl. Manage. 10(03), 285–299 (2011)CrossRef Yaguinuma, C.A., Afonso, G.F., Ferraz, V., Borges, S., Santos, M.T.: A fuzzy ontology-based semantic data integration system. J. Inf. Knowl. Manage. 10(03), 285–299 (2011)CrossRef
30.
go back to reference Zhang, D., Song, T., He, J., Shi, X., Dong, Y.: A similarity-oriented RDF graph matching algorithm for ranking linked data. In: 2012 IEEE 12th International Conference on Computer and Information Technology (CIT), pp. 427–434. IEEE (2012) Zhang, D., Song, T., He, J., Shi, X., Dong, Y.: A similarity-oriented RDF graph matching algorithm for ranking linked data. In: 2012 IEEE 12th International Conference on Computer and Information Technology (CIT), pp. 427–434. IEEE (2012)
Metadata
Title
SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection
Authors
Jesús M. Almendros-Jiménez
Alfredo Cuzzocrea
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-32034-2_2

Premium Partner