Abstract
Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is their potential to generate many different results, carrying rather diverse semantic meanings. In order to generate desired clustering, we propose to use meta-path, a path that connects object types via a sequence of relations, to control clustering with distinct semantics. Nevertheless, it is easier for a user to provide a few examples (seeds) than a weighted combination of sophisticated meta-paths to specify her clustering preference. Thus, we propose to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance. Then the system learns the weight for each meta-path that is consistent with the clustering result implied by the guidance, and generates clusters under the learned weights of meta-paths. A probabilistic approach is proposed to solve the problem, and an effective and efficient iterative algorithm, PathSelClus, is proposed to learn the model, where the clustering quality and the meta-path weights mutually enhance each other. Our experiments with several clustering tasks in two real networks and one synthetic network demonstrate the power of the algorithm in comparison with the baselines.
- Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. 2008. Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981--2014. Google ScholarDigital Library
- Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749. Google ScholarCross Ref
- Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. 2005. Learning a Mahalanobis metric from equivalence constraints. J. Mach. Learn. Res. 6, 937--965. Google ScholarDigital Library
- Basu, S., Banerjee, A., and Mooney, R. 2002. Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Basu, S., Bilenko, M., and Mooney, R. J. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 59--68. Google ScholarDigital Library
- Bilenko, M., Basu, S., and Mooney, R. J. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML). Google ScholarDigital Library
- Cohn, D. and Chang, H. 2000. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning (ICML). 167--174. Google ScholarDigital Library
- Dhillon, I. S., Mallela, S., and Kumar, R. 2003. A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265--1287. Google ScholarCross Ref
- Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarDigital Library
- Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 50--57. Google ScholarDigital Library
- Kulis, B., Basu, S., Dhillon, I., and Mooney, R. 2005. Semi-supervised graph clustering: A kernel approach. In Proceedings of the 22nd International Conference on Machine Learning (ICML). 457--464. Google ScholarDigital Library
- Lee, D. D. and Seung, H. S. 2000. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13 (NIPS), 556--562.Google Scholar
- Long, B., Zhang, Z. M., Wu, X., and Yu, P. S. 2006. Spectral clustering for multi-type relational data. In Proceedings of the 23rd International Conference on Machine Learning (ICML). 585--592. Google ScholarDigital Library
- Long, B., Zhang, Z. M., and Yu, P. S. 2007. A probabilistic framework for relational clustering. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 470--479. Google ScholarDigital Library
- Luxburg, U. 2007. A tutorial on spectral clustering. Statist. Comput. 17, 395--416. Google ScholarDigital Library
- Newman, M. E. J. 2004. Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 6.Google Scholar
- Newman, M. E. J. and Girvan, M. 2004. Finding and evaluating community structure in networks. Phys. Rev. E 69, 2.Google ScholarCross Ref
- Punera, K. and Ghosh, J. 2008. Consensus-based ensembles of soft clusterings. Appl. Artif. Intell. 22, 780--810. Google ScholarDigital Library
- Shi, J. and Malik, J. 1997. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888--905. Google ScholarDigital Library
- Strehl, A., Ghosh, J., and Cardie, C. 2002. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583--617. Google ScholarDigital Library
- Sun, Y., Han, J., Zhao, P., Yi, N. Z., Cheng, H., and Wu, T. 2009a. Rankclus: Integrating clustering with ranking for heterogeneous information network analysis. In Proceedings of the 12th International Conference on Extending Database Technology (EDBT). 565--576. Google ScholarDigital Library
- Sun, Y., Yu, Y., and Han, J. 2009b. Ranking-based clustering of heterogeneous information networks with star network schema. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 797--806. Google ScholarDigital Library
- Sun, Y., Han, J., Yan, X., Yu, P. S., and Wu, T. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In Proceedings of International Conference on Very Large Data Bases (VLDB). 992--1003.Google Scholar
- Sun, Y., Norick, B., Han, J., Yan, X., Yu, P. S., and Yu, X. 2012. Integrating meta-path selection with user guided object clustering in heterogeneous information networks. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1348--1356. Google ScholarDigital Library
- Wang, F., Li, T., Wang, X., Zhu, S., and Ding, C. 2011. Community discovery using nonnegative matrix factorization. Data Min. Knowl. Discov. 22, 3, 493--521. Google ScholarDigital Library
- Wang, N., Parthasarathy, S., Tan, K.-L., and Tung, A. K. H. 2008. CSV: Visualizing and mining cohesive subgraphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 445--458. Google ScholarDigital Library
- Xu, X., Yuruk, N., Feng, Z., and Schweiger, T. A. J. 2007. Scan: A structural clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 824--833. Google ScholarDigital Library
- Xu, Z., King, I., Lyu, M. R.-T., and Jin, R. 2010. Discriminative semi-supervised feature selection via manifold regularization. Trans. Neur. Netw. 21, 1033--1047. Google ScholarDigital Library
- Yin, X., Han, J., and Yu, P. S. 2007. Crossclus: User-guided multirelational clustering. Data Min. Knowl. Discov. 15, 3, 321--348. Google ScholarDigital Library
- Zhao, Z. and Liu, H. 2007. Semi-supervised feature selection via spectral analysis. In Proceedings of the 7th SIAM International Conference on Data Mining (SDM).Google Scholar
- Zhu, X. and Ghahramani, Z. 2002. Learning from labeled and unlabeled data with label propagation. Tech. rep. CMU-CALD-02-107, Carnegie Mellon University.Google Scholar
- Zhu, X., Ghahramani, Z., and Lafferty, J. D. 2003. Semi-Supervised learning using Gaussian fields and harmonic functions. In Proceedings of the 20th International Conference (ICML). 912--919.Google Scholar
Index Terms
- PathSelClus: Integrating Meta-Path Selection with User-Guided Object Clustering in Heterogeneous Information Networks
Recommendations
Integrating meta-path selection with user-guided object clustering in heterogeneous information networks
KDD '12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data miningReal-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is its potential to generate many different results, carrying rather diverse semantic ...
User-Guided Clustering in Heterogeneous Information Networks via Motif-Based Comprehensive Transcription
Machine Learning and Knowledge Discovery in DatabasesAbstractHeterogeneous information networks (HINs) with rich semantics are ubiquitous in real-world applications. For a given HIN, many reasonable clustering results with distinct semantic meaning can simultaneously exist. User-guided clustering is hence ...
Evolutionary Clustering and Analysis of Bibliographic Networks
ASONAM '11: Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and MiningIn this paper, we study the problem of evolutionary clustering of multi-typed objects in a heterogeneous bibliographic network. The traditional methods of homogeneous clustering methods do not result in a good typed-clustering. The design of ...
Comments