ABSTRACT
As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well.
In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed (i.e., heterogeneous) information network. A novel clustering framework called RankClus is proposed that directly generates clusters integrated with ranking. Based on initial K clusters, ranking is applied separately, which serves as a good measure for each cluster. Then, we use a mixture model to decompose each object into a K-dimensional vector, where each dimension is a component coefficient with respect to a cluster, which is measured by rank distribution. Objects then are reassigned to the nearest cluster under the new measure space to improve clustering. As a result, quality of clustering and ranking are mutually enhanced, which means that the clusters are getting more accurate and the ranking is getting more meaningful. Such a progressive refinement process iterates until little change can be made. Our experiment results show that RankClus can generate more accurate clusters and in a more efficient way than the state-of-the-art link-based clustering methods. Moreover, the clustering results with ranks can provide more informative views of data compared with traditional clustering.
- J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997.Google Scholar
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1--7):107--117, 1998. Google ScholarDigital Library
- D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. pages 318--329, 1992.Google Scholar
- DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google Scholar
- J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods, chapter 7 Evaluation of Eigenvalues, pages 245--247. Springer, 1 edition, 2004.Google Scholar
- C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD'06), page 2, 2006. Google ScholarDigital Library
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB'04), pages 576--587. VLDB Endowment, 2004. Google ScholarDigital Library
- J. E. Hirsch. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences, 102:16569, 2005.Google ScholarCross Ref
- G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD'02), pages 538--543. ACM, 2002. Google ScholarDigital Library
- W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE'05), pages 1061--1072, 2005. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007. Google ScholarDigital Library
- Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW'05), pages 567--574. ACM, May 2005. Google ScholarDigital Library
- S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI'07, pages 1898--1899, 2007. Google ScholarDigital Library
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
- A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006.Google Scholar
- X. Yin, J. Han, and P. S. Yu. Linkclus: Efficient clustering via heterogeneous semantic links. In Proceedings of the 32nd VLDB conference (VLDB'06), pages 427--438, 2006. Google ScholarDigital Library
- O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. pages 1361--1374, 1999.Google Scholar
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology
This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
On cluster tree for nested and multi-density data clustering
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Comments