ABSTRACT
XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.
- C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering. Proceedings of the ACM SIGMOD Conference, 1999. Google ScholarDigital Library
- C. C. Aggarwal. A Human-Computer Interactive Method for Projected Clustering. IEEE Transactions on Knowledge and Data Engineering, 16(4), 448--460, 2004. Google ScholarDigital Library
- T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, S. Arikawa. Efficient substructure discovery from large semi-structured data. ACM SIAM International Conference on Data Mining, 2002.Google ScholarCross Ref
- S. S. Chawathe. Comparing Hierachical data in external memory. Very Large Data Bases Conference, 1999. Google ScholarDigital Library
- T. Dalamagas, T. Cheng, K. Winkel, T. Sellis. Clustering XML Documents Using Structural Summaries. Information Systems, Elsevier, January 2005. Also appeared in EDBT 2004 Workshops on Current Trends in Database Technology, 2004. Google ScholarDigital Library
- S. Guha, R. Rastogi, K. Shim. ROCK: a Robust Clustering Algorithm for Categorical Attributes, International Conference on Data Engineering, 1999. Google ScholarDigital Library
- A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs NJ, USA, 1988. Google ScholarDigital Library
- M. Lee, W. Hsu, L. Yang, X. Yang. XClust: Clustering XML Schemas for Effective Integration. ACM Conference on Information and Knowledge Management, 2002. Google ScholarDigital Library
- W. Li, J. Han, J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. International Conference on Data Mining, 2001. Google ScholarDigital Library
- W. Lian, D. W. Cheung, N. Mamoulis, S. Yiu. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering, Vol 16, No. 1, 2004. Google ScholarDigital Library
- R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB Conference, 1994. Google ScholarDigital Library
- J. Pei, J. Han, B.-M. Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. International Conference on Data Engineering, 2001. Google ScholarDigital Library
- A. Termier, M-C. Rousset, M. Sebag. TreeFinder: a First Step towards XML Data Mining. International Conference on Data Mining, 2002. Google ScholarDigital Library
- J. Wang, J Han. BIDE: Efficient Mining of Frequent Closed Sequences. International Conference on Data Engineering, 2004. Google ScholarDigital Library
- K. Wang, H. Q. Liu. Discovering Typical Structures of Documents: A Road Map Approach. ACM SIGIR Conference, 1998. Google ScholarDigital Library
- M. J. Zaki, C. C. Aggarwal. XRules: An Effective Structural Classifier for XML Data. ACM KDD Conference, 2003. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996. Google ScholarDigital Library
- M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. ACM KDD Conference, 2002. Google ScholarDigital Library
Index Terms
- Xproj: a framework for projected structural clustering of xml documents
Recommendations
A schema matching-based approach to XML schema clustering
iiWAS '08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & ServicesThe relationship between XML data clustering and schema matching is bidirectional. On one side, clustering techniques have been adopted to improve matching performance, and on the other side schema matching is the backbone of the clustering technique. ...
A methodology for clustering XML documents by structure
The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. ...
Document Clustering Using Incremental and Pairwise Approaches
Focused Access to XML DocumentsAbstractThis paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a ...
Comments