Article

Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters

Authors:
Flavia Moser

Simon Fraser University

Simon Fraser University
View Profile

,
Rong Ge

Simon Fraser University

Simon Fraser University
View Profile

,
Martin Ester

Simon Fraser University

Simon Fraser University
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 510–519https://doi.org/10.1145/1281192.1281248

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 510–519

ABSTRACT

In many applications, attribute and relationship data areavailable, carrying complementary information about real world entities. In such cases, a joint analysis of both types of data can yield more accurate results than classical clustering algorithms that either use only attribute data or only relationship (graph) data. The Connected k-Center (CkC) has been proposed as the first joint cluster analysis model to discover k clusters which are cohesive on both attribute and relationship data. However, it is well-known that prior knowledge on the number of clusters is often unavailable in applications such as community dentification and hotspot analysis. In this paper, we introduce and formalize the problem of discovering an a-priori unspecified number of clusters in the context of joint cluster analysis of attribute and relationship data, called Connected X Clusters (CXC) problem. True clusters are assumed to be compact and distinctive from their neighboring clusters in terms of attribute data and internally connected in terms of relationship data. Different from classical attribute-based clustering methods, the neighborhood of clusters is not defined in terms of attribute data but in terms of relationship data. To efficiently solve the CXC problem, we present JointClust, an algorithm which adopts a dynamic two-phase approach. In the first phase, we find so called cluster atoms. We provide a probability analysis for thisphase, which gives us a probabilistic guarantee, that each true cluster is represented by at least one of the initial cluster atoms. In the second phase, these cluster atoms are merged in a bottom-up manner resulting in a dendrogram. The final clustering is determined by our objective function. Our experimental evaluation on several real datasets demonstrates that JointClust indeed discovers meaningful and accurate clusterings without requiring the user to specify the number of clusters.

References

M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In SIGMOD, 1999. Google ScholarDigital Library
C. M. Bishop. Neural networks for pattern recognition. Oxford: Clarendon Press, 1995. Google ScholarDigital Library
P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In ICML, 1998. Google ScholarDigital Library
U. Brandes, M. Gaertler, and D. Wagner. Experiments on graph clustering algorithms. In Algorithms -- ESA 2003, 11th Annual European Symposium, Budapest,Hungary, 2003.Google Scholar
C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A system for region--based image indexing and retrieval. In VIS. Springer, 1999. Google ScholarDigital Library
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. A min-max cut algorithm for graph partitioning and data clustering. ICDM, 2001. Google ScholarDigital Library
R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.Google ScholarDigital Library
M. Ester, R. Ge, B. J. Gao, Z. Hu, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In SDM, 2006.Google ScholarCross Ref
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.Google ScholarDigital Library
W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. New York: Wiley, 1968.Google Scholar
P. Fjallstrom. Algorithms for graph partitioning: A survey. Linkoping Electronic Articles in Computer and Information Science, 1998.Google Scholar
M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified np--complete problems. In Sixth annual ACM symposium on Theory of computing, 1974. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD, 1998. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Information Systems, 2000. Google ScholarDigital Library
L. Hagen and A. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. on Computed Aided Design, 1992.Google Scholar
R. A. Hanneman and M. Riddle. Introduction to social network methods. http://faculty.ucr.edu/$\sim$hanneman/, 2005.Google Scholar
A. Jain and R. Dubes. Algorithms for clustering data. Prentice Hall, 1988. Google ScholarDigital Library
G. Karypis, E.-H. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 1999. Google ScholarDigital Library
L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley, 1990.Google Scholar
T. M. Mitchell. Machine Learning. New York; London: McGraw-Hill, 1997. Google ScholarDigital Library
M. Newman. The structure of scientific collaboration networks. In Proc. Natl. Acad. Sci. 98, 2001.Google Scholar
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review, 2004.Google ScholarCross Ref
A. Okabe, K.-I. Okunuki, and S. Shino. The sanet toolbox: New methods for network spatial analysis. In Transactions in GIS. Blackwell Publishing, July 2006.Google Scholar
D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, San Francisco, 2000. Google ScholarDigital Library
G. Salton and M. J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
J. Scott. Social Network Analysis: A handbook. Sage Publications, London, 2000.Google Scholar
J. Shi and J. Malik. Normalized cuts and image segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. Google ScholarDigital Library
S. Wasserman and K. Faust. Social Network Analysis. Cambridge University Press, 1994.Google ScholarCross Ref
Y. Wei and C. Cheng. Towards efficient hierarchical designs by ratio cut partitioning. In IEEE Conf. on Computer-Aided Design, 1989.Google Scholar
S. White and P. Smyth. A spectral clustering approach to finding comm. in graphs. In SDM, 2005.Google ScholarCross Ref
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD, 1996. Google ScholarDigital Library

Index Terms

Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications

Attribute data and relationship data are two principal types of data, representing the intrinsic and extrinsic properties of entities. While attribute data have been the main source of data for cluster analysis, relationship data such as social networks ...
Read More
Determining the number of clusters using information entropy for mixed data

In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been ...
Read More
Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters

In this paper, we present an agglomerative fuzzy $k$-means clustering algorithm for numerical data, an extension to the standard fuzzy $k$-means algorithm by introducing a penalty term to the objective function to make the clustering process not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
clustering
community identification
graph-structured data
hotspot analysis
joint cluster analysis
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 877
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Joint cluster analysis of attribute and relationship data withouta-priori specification of the number of clusters

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications

Determining the number of clusters using information entropy for mixed data

Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters