Article

Xproj: a framework for projected structural clustering of xml documents

Authors:
Charu C. Aggarwal

IBM

IBM
View Profile

,
Na Ta

Tsinghua University

Tsinghua University
View Profile

,
Jianyong Wang

Tsinghua University

Tsinghua University
View Profile

,
Jianhua Feng

Tsinghua University

Tsinghua University
View Profile

,
Mohammed Zaki

Rensselear Polytechnic Institute

Rensselear Polytechnic Institute
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 46–55https://doi.org/10.1145/1281192.1281201

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 46–55

ABSTRACT

XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

References

C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S. Yu, J.-S. Park. Fast Algorithms for Projected Clustering. Proceedings of the ACM SIGMOD Conference, 1999. Google ScholarDigital Library
C. C. Aggarwal. A Human-Computer Interactive Method for Projected Clustering. IEEE Transactions on Knowledge and Data Engineering, 16(4), 448--460, 2004. Google ScholarDigital Library
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, S. Arikawa. Efficient substructure discovery from large semi-structured data. ACM SIAM International Conference on Data Mining, 2002.Google ScholarCross Ref
S. S. Chawathe. Comparing Hierachical data in external memory. Very Large Data Bases Conference, 1999. Google ScholarDigital Library
T. Dalamagas, T. Cheng, K. Winkel, T. Sellis. Clustering XML Documents Using Structural Summaries. Information Systems, Elsevier, January 2005. Also appeared in EDBT 2004 Workshops on Current Trends in Database Technology, 2004. Google ScholarDigital Library
S. Guha, R. Rastogi, K. Shim. ROCK: a Robust Clustering Algorithm for Categorical Attributes, International Conference on Data Engineering, 1999. Google ScholarDigital Library
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs NJ, USA, 1988. Google ScholarDigital Library
M. Lee, W. Hsu, L. Yang, X. Yang. XClust: Clustering XML Schemas for Effective Integration. ACM Conference on Information and Knowledge Management, 2002. Google ScholarDigital Library
W. Li, J. Han, J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. International Conference on Data Mining, 2001. Google ScholarDigital Library
W. Lian, D. W. Cheung, N. Mamoulis, S. Yiu. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering, Vol 16, No. 1, 2004. Google ScholarDigital Library
R. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB Conference, 1994. Google ScholarDigital Library
J. Pei, J. Han, B.-M. Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. International Conference on Data Engineering, 2001. Google ScholarDigital Library
A. Termier, M-C. Rousset, M. Sebag. TreeFinder: a First Step towards XML Data Mining. International Conference on Data Mining, 2002. Google ScholarDigital Library
J. Wang, J Han. BIDE: Efficient Mining of Frequent Closed Sequences. International Conference on Data Engineering, 2004. Google ScholarDigital Library
K. Wang, H. Q. Liu. Discovering Typical Structures of Documents: A Road Map Approach. ACM SIGIR Conference, 1998. Google ScholarDigital Library
M. J. Zaki, C. C. Aggarwal. XRules: An Effective Structural Classifier for XML Data. ACM KDD Conference, 2003. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Conference, 1996. Google ScholarDigital Library
M. J. Zaki. Efficiently Mining Frequent Trees in a Forest. ACM KDD Conference, 2002. Google ScholarDigital Library

Index Terms

Xproj: a framework for projected structural clustering of xml documents
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A schema matching-based approach to XML schema clustering
iiWAS '08: Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services

The relationship between XML data clustering and schema matching is bidirectional. On one side, clustering techniques have been adopted to improve matching performance, and on the other side schema matching is the backbone of the clustering technique. ...
Read More
A methodology for clustering XML documents by structure

The processing and management of XML data are popular research issues. However, operations based on the structure of XML data have not received strong attention. These operations involve, among others, the grouping of structurally similar XML documents. ...
Read More
Document Clustering Using Incremental and Pairwise Approaches
Focused Access to XML Documents
Abstract
This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
XML
clustering
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 103
  Total Citations
  View Citations
- 836
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Xproj: a framework for projected structural clustering of xml documents

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A schema matching-based approach to XML schema clustering

A methodology for clustering XML documents by structure

Document Clustering Using Incremental and Pairwise Approaches