skip to main content
research-article

PathSim: meta path-based top-K similarity search in heterogeneous information networks

Published:01 August 2011Publication History
Skip Abstract Section

Abstract

Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks.

In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept of meta path-based similarity, where a meta path is a path consisting of a sequence of relations defined between different object types (i.e., structural paths at the meta level). No matter whether a user would like to explicitly specify a path combination given sufficient domain knowledge, or choose the best path by experimental trials, or simply provide training examples to learn it, meta path forms a common base for a network-based similarity search engine. In particular, under the meta path framework we define a novel similarity measure called PathSim that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures. In order to support fast online query processing for PathSim queries, we develop an efficient solution that partially materializes short meta paths and then concatenates them online to compute top-k results. Experiments on real data sets demonstrate the effectiveness and efficiency of our proposed paradigm.

References

  1. A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: authority-based keyword search in databases. In VLDB'04, 564--575, 2004.Google ScholarGoogle Scholar
  2. S. Berchtold, B. Ertl, D. A. Keim, H.-P. Kriegel, and T. Seidl. Fast nearest neighbor search in high-dimensional space. In ICDE'98, 209--218, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  3. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD'03, 89--98, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS'01, 102--113, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós. Towards scaling fully personalized pageRank: algorithms, lower bounds, and experiments. Int. Math., 2(3):333--358, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Gupta, A. Pathak, and S. Chakrabarti. Fast algorithms for topk personalized pagerank queries. In WWW'08, 1225--1226, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD'02, 538--543, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4):422--446, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Jeh and J. Widom. Scaling personalized web search. In WWW'03, 271--279, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Kolahdouzan and C. Shahabi. Voronoi-based k nearest neighbor search for spatial network databases. In VLDB'04, 840--851, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  12. D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. PVLDB, 1(1):422--433, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: bringing order to web objects. In WWW'05, 567--574, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Pan, X. Zhang, and W. Wang. Crd: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In SIGMOD'08, 173--184, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Shi, and J. Malik Normalized cuts and image segmentation. IEEE Trans. on PAMI, 22(8):888--905, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT'09, 565--576, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Sun, J. Han, J. Gao, and Y. Yu. iTopicModel: Information Network-Integrated Topic Modeling. In ICDM'09, 493--502, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Tong, C. Faloutsos, J. Pan. Fast Random Walk with Restart and Its Applications. In ICDM'06, 613--622, 2006.Google ScholarGoogle Scholar
  19. C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE'09, 916--927, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD'07, 824--833, 2007.Google ScholarGoogle Scholar

Index Terms

  1. PathSim: meta path-based top-K similarity search in heterogeneous information networks
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 4, Issue 11
        August 2011
        520 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2011
        Published in pvldb Volume 4, Issue 11

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader