Skip to main content
Erschienen in: Artificial Intelligence Review 3/2015

01.03.2015

XML document clustering: techniques and challenges

verfasst von: Elaheh Asghari, MohammadReza KeyvanPour

Erschienen in: Artificial Intelligence Review | Ausgabe 3/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The increasing availability of heterogeneous XML sources has raised a number of issues concerning how to represent and manage these semi-structured data. In recent years due to the importance of managing these resources and extracting knowledge from them, lots of methods have been proposed in order to represent and cluster them in different ways. Different similarity measures have been extended and also in some context semantic issues have been taken into account. In this context, we review different XML clustering methods with considering different representation methods such as tree based and vector based with use of different similarity measures. We also propose taxonomy for these proposed methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Alshahat A, Algergawy A (2010) Management of xml data by means of schema matching. Publisher Dr, Hut. ISBN 3868533834, 9783868533835 Alshahat A, Algergawy A (2010) Management of xml data by means of schema matching. Publisher Dr, Hut. ISBN 3868533834, 9783868533835
Zurück zum Zitat Antonellis P, Makris C, Tsirakis N (2008) XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries. In: Proceedings of the 2008 ACM symposium on applied computing (SAC ’08). ACM, New York, NY, USA, pp 1081–1088 Antonellis P, Makris C, Tsirakis N (2008) XEdge: clustering homogeneous and heterogeneous XML documents using edge summaries. In: Proceedings of the 2008 ACM symposium on applied computing (SAC ’08). ACM, New York, NY, USA, pp 1081–1088
Zurück zum Zitat Bray T, Paoli J (2000) Extensible markup language (XML) 1.0, 2nd edn. Sperberg-McQueen CM University of Illinois at Chicago and text encoding initiative. Sun Microsystems Inc, Eve Maler Bray T, Paoli J (2000) Extensible markup language (XML) 1.0, 2nd edn. Sperberg-McQueen CM University of Illinois at Chicago and text encoding initiative. Sun Microsystems Inc, Eve Maler
Zurück zum Zitat Dalamagas T, Cheng T, Winkel KJ, Sellis T (2006) A methodology for clustering XML documents by structure. Inf Syst 31(3):187–228 Dalamagas T, Cheng T, Winkel KJ, Sellis T (2006) A methodology for clustering XML documents by structure. Inf Syst 31(3):187–228
Zurück zum Zitat Doucet A, Lehtonen M (2006) Unsupervised classification of text-centric XML document collections. In: Comparative evaluation of XML information retrieval systems, 5th international workshop of the initiative for the evaluation of XML retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17–20, 2006, Revised and selected papers. Volume 4518 of Lecture Notes in Computer Science. Springer, pp 497–509 Doucet A, Lehtonen M (2006) Unsupervised classification of text-centric XML document collections. In: Comparative evaluation of XML information retrieval systems, 5th international workshop of the initiative for the evaluation of XML retrieval, INEX 2006, Dagstuhl Castle, Germany, December 17–20, 2006, Revised and selected papers. Volume 4518 of Lecture Notes in Computer Science. Springer, pp 497–509
Zurück zum Zitat Doucet A, Myka HA (2002) Naive clustering of a large XML document collection. In: Proceedings of the INEX annual ERCIM, workshop, pp 81–88 Doucet A, Myka HA (2002) Naive clustering of a large XML document collection. In: Proceedings of the INEX annual ERCIM, workshop, pp 81–88
Zurück zum Zitat Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB) Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of the international workshop on the web and databases (WebDB)
Zurück zum Zitat Flesca S, Manco G, Masciari E, Pontieri L (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2):160–175CrossRef Flesca S, Manco G, Masciari E, Pontieri L (2005) Fast detection of XML structural similarity. IEEE Trans Knowl Data Eng 17(2):160–175CrossRef
Zurück zum Zitat Kozielski M (2007) Application of different clustering algorithms to multilevel clustering of XML documents, vol 16. Institute of Informatics, Silesian University of Technology, Akademicka Gliwice, pp 44–100 Kozielski M (2007) Application of different clustering algorithms to multilevel clustering of XML documents, vol 16. Institute of Informatics, Silesian University of Technology, Akademicka Gliwice, pp 44–100
Zurück zum Zitat Lian W, Cheung DW, Mamoulis N, Yiu SM (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96CrossRef Lian W, Cheung DW, Mamoulis N, Yiu SM (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96CrossRef
Zurück zum Zitat Nayak R (2006) Investigating semantic measures in XML clustering. In: Proceedings of the (2006) IEEE/WIC/ACM international conference on web intelligence (WI ’06). IEEE Computer Society, Washington, DC, USA, pp 1042–1045 Nayak R (2006) Investigating semantic measures in XML clustering. In: Proceedings of the (2006) IEEE/WIC/ACM international conference on web intelligence (WI ’06). IEEE Computer Society, Washington, DC, USA, pp 1042–1045
Zurück zum Zitat Nayak R, De Vries CM, Kutty S, Geva Sh, Denoyer L, Gallinari P (2009) Overview of the INEX 2009 XML mining track : clustering and classification of XML documents. In: Focused retrieval and evaluation: proceedings of 8th international workshop of the initiative for the evaluation of XML retrieval, INEX (2009). Springer, Brisbane, Queensland, pp 366–378 Nayak R, De Vries CM, Kutty S, Geva Sh, Denoyer L, Gallinari P (2009) Overview of the INEX 2009 XML mining track : clustering and classification of XML documents. In: Focused retrieval and evaluation: proceedings of 8th international workshop of the initiative for the evaluation of XML retrieval, INEX (2009). Springer, Brisbane, Queensland, pp 366–378
Zurück zum Zitat Nayak R, Xu S (2006) XCLS: a fast and effective clustering algorithm for heterogenous XML documents. In: Ng WK, Kitsuregawa M, Chang K (eds) Advances in knowledge discovery and data mining: proceedings of the 10th Pacific-Asia conference (LNCS 3918) 9–12 April, 2006, Singapore Nayak R, Xu S (2006) XCLS: a fast and effective clustering algorithm for heterogenous XML documents. In: Ng WK, Kitsuregawa M, Chang K (eds) Advances in knowledge discovery and data mining: proceedings of the 10th Pacific-Asia conference (LNCS 3918) 9–12 April, 2006, Singapore
Zurück zum Zitat Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings ACM SIGMOD WebDB (international workshop on the web and databases), workshop, pp 61–66 Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings ACM SIGMOD WebDB (international workshop on the web and databases), workshop, pp 61–66
Zurück zum Zitat Ruso LR (2012) XML data mining, part 3: clustering XML documents for improved data mining. DW and BI consultant, computershare technology services Australia, La Trobe University Australia, Development Team Lead Ruso LR (2012) XML data mining, part 3: clustering XML documents for improved data mining. DW and BI consultant, computershare technology services Australia, La Trobe University Australia, Development Team Lead
Zurück zum Zitat Tagarelli A, Greco S (2006) Toward semantic XML clustering. In: Proceedings of the sixth SIAM international conference on data mining, University of Calabria Tagarelli A, Greco S (2006) Toward semantic XML clustering. In: Proceedings of the sixth SIAM international conference on data mining, University of Calabria
Zurück zum Zitat Tagarelli A, Greco S (2010) Semantic clustering of XML documents. ACM Trans Inf Syst 28(1):3CrossRef Tagarelli A, Greco S (2010) Semantic clustering of XML documents. ACM Trans Inf Syst 28(1):3CrossRef
Zurück zum Zitat Yang J, Cheung W K, Chen X (2005) Learning the Kernel matrix for XML document clustering. In: IEEE international conference on e-technology, e-commerce and e-service, pp 353–358 Yang J, Cheung W K, Chen X (2005) Learning the Kernel matrix for XML document clustering. In: IEEE international conference on e-technology, e-commerce and e-service, pp 353–358
Zurück zum Zitat Yang R, Kalnis P, Tung A (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM international conference on management of data, pp 754–765 Yang R, Kalnis P, Tung A (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM international conference on management of data, pp 754–765
Zurück zum Zitat Yoon J, Raghavan V, Chakilam V, Kerschberg V (2001) BitCube: a three-dimensional bitmap indexing for XML documents. J Intell Inf Syst 17:241–254CrossRefMATH Yoon J, Raghavan V, Chakilam V, Kerschberg V (2001) BitCube: a three-dimensional bitmap indexing for XML documents. J Intell Inf Syst 17:241–254CrossRefMATH
Zurück zum Zitat Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefMATHMathSciNet Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefMATHMathSciNet
Zurück zum Zitat Zhao B, Zhang Y, Zhang H (2008) A robust clustering method for XML documents. In: International conference on information management, innovation management and industrial engineering Zhao B, Zhang Y, Zhang H (2008) A robust clustering method for XML documents. In: International conference on information management, innovation management and industrial engineering
Metadaten
Titel
XML document clustering: techniques and challenges
verfasst von
Elaheh Asghari
MohammadReza KeyvanPour
Publikationsdatum
01.03.2015
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence Review / Ausgabe 3/2015
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-012-9379-2

Weitere Artikel der Ausgabe 3/2015

Artificial Intelligence Review 3/2015 Zur Ausgabe