RankClus: integrating clustering with ranking for heterogeneous information network analysis

Authors:
Yizhou Sun

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Jiawei Han

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Peixiang Zhao

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Zhijun Yin

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

,
Hong Cheng

The Chinese University of Hong Kong

The Chinese University of Hong Kong
View Profile

,
Tianyi Wu

University of Illinois at Urbana Champaign

University of Illinois at Urbana Champaign
View Profile

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database TechnologyMarch 2009Pages 565–576https://doi.org/10.1145/1516360.1516426

Published:24 March 2009Publication History

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Pages 565–576

ABSTRACT

As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well.

In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed (i.e., heterogeneous) information network. A novel clustering framework called RankClus is proposed that directly generates clusters integrated with ranking. Based on initial K clusters, ranking is applied separately, which serves as a good measure for each cluster. Then, we use a mixture model to decompose each object into a K-dimensional vector, where each dimension is a component coefficient with respect to a cluster, which is measured by rank distribution. Objects then are reassigned to the nearest cluster under the new measure space to improve clustering. As a result, quality of clustering and ranking are mutually enhanced, which means that the clusters are getting more accurate and the ranking is getting more meaningful. Such a progressive refinement process iterates until little change can be made. Our experiment results show that RankClus can generate more accurate clusters and in a more efficient way than the state-of-the-art link-based clustering methods. Moreover, the clustering results with ranks can provide more informative views of data compared with traditional clustering.

References

J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.Google Scholar
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1--7):107--117, 1998. Google ScholarDigital Library
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. pages 318--329, 1992.Google Scholar
DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google Scholar
J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods, chapter 7 Evaluation of Eigenvalues, pages 245--247. Springer, 1 edition, 2004.Google Scholar
C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD'06), page 2, 2006. Google ScholarDigital Library
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB'04), pages 576--587. VLDB Endowment, 2004. Google ScholarDigital Library
J. E. Hirsch. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences, 102:16569, 2005.Google ScholarCross Ref
G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD'02), pages 538--543. ACM, 2002. Google ScholarDigital Library
W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE'05), pages 1061--1072, 2005. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999. Google ScholarDigital Library
U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007. Google ScholarDigital Library
Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW'05), pages 567--574. ACM, May 2005. Google ScholarDigital Library
S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI'07, pages 1898--1899, 2007. Google ScholarDigital Library
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006.Google Scholar
X. Yin, J. Han, and P. S. Yu. Linkclus: Efficient clustering via heterogeneous semantic links. In Proceedings of the 32nd VLDB conference (VLDB'06), pages 427--438, 2006. Google ScholarDigital Library
O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. pages 1361--1374, 1999.Google Scholar

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Read More
On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
March 2009
1180 pages
ISBN:9781605584225
DOI:10.1145/1516360
Editors:
Martin Kersten
CWI, The Netherlands
,
Boris Novikov
University of Saint Petersburg, Russia
,
Jens Teubner
ETH Zurich, Switzerland
,
Vladimir Polutin
HP Labs, Russia
,
Stefan Manegold
CWI, The Netherlands
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 295
  Total Citations
  View Citations
- 2,937
  Total Downloads
- Downloads (Last 12 months)340
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

RankClus: integrating clustering with ranking for heterogeneous information network analysis

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

ABSTRACT

References

Cited By

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

On cluster tree for nested and multi-density data clustering