research-article

PathSim: meta path-based top-K similarity search in heterogeneous information networks

Authors:
Yizhou Sun

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Xifeng Yan

University of California at Santa Barbara

University of California at Santa Barbara
View Profile

,
Philip S. Yu

University of Illinois at Chicago

University of Illinois at Chicago
View Profile

,
Tianyi Wu

Microsoft Corporation

Microsoft Corporation
View Profile

Proceedings of the VLDB Endowment Volume 4 Issue 11pp 992–1003https://doi.org/10.14778/3402707.3402736

Published:01 August 2011Publication History

Proceedings of the VLDB Endowment

Abstract

Similarity search is a primitive operation in database and Web search engines. With the advent of large-scale heterogeneous information networks that consist of multi-typed, interconnected objects, such as the bibliographic networks and social media networks, it is important to study similarity search in such networks. Intuitively, two objects are similar if they are linked by many paths in the network. However, most existing similarity measures are defined for homogeneous networks. Different semantic meanings behind paths are not taken into consideration. Thus they cannot be directly applied to heterogeneous networks.

In this paper, we study similarity search that is defined among the same type of objects in heterogeneous networks. Moreover, by considering different linkage paths in a network, one could derive various similarity semantics. Therefore, we introduce the concept of meta path-based similarity, where a meta path is a path consisting of a sequence of relations defined between different object types (i.e., structural paths at the meta level). No matter whether a user would like to explicitly specify a path combination given sufficient domain knowledge, or choose the best path by experimental trials, or simply provide training examples to learn it, meta path forms a common base for a network-based similarity search engine. In particular, under the meta path framework we define a novel similarity measure called PathSim that is able to find peer objects in the network (e.g., find authors in the similar field and with similar reputation), which turns out to be more meaningful in many scenarios compared with random-walk based similarity measures. In order to support fast online query processing for PathSim queries, we develop an efficient solution that partially materializes short meta paths and then concatenates them online to compute top-k results. Experiments on real data sets demonstrate the effectiveness and efficiency of our proposed paradigm.

References

A. Balmin, V. Hristidis, and Y. Papakonstantinou. Objectrank: authority-based keyword search in databases. In VLDB'04, 564--575, 2004.Google Scholar
S. Berchtold, B. Ertl, D. A. Keim, H.-P. Kriegel, and T. Seidl. Fast nearest neighbor search in high-dimensional space. In ICDE'98, 209--218, 1998.Google ScholarCross Ref
I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In KDD'03, 89--98, 2003.Google ScholarDigital Library
L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945.Google ScholarCross Ref
R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS'01, 102--113, 2001.Google ScholarDigital Library
D. Fogaras, B. Rácz, K. Csalogány, and T. Sarlós. Towards scaling fully personalized pageRank: algorithms, lower bounds, and experiments. Int. Math., 2(3):333--358, 2005.Google ScholarCross Ref
M. Gupta, A. Pathak, and S. Chakrabarti. Fast algorithms for topk personalized pagerank queries. In WWW'08, 1225--1226, 2008.Google ScholarDigital Library
G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD'02, 538--543, 2002.Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4):422--446, 2002.Google ScholarDigital Library
G. Jeh and J. Widom. Scaling personalized web search. In WWW'03, 271--279, 2003.Google ScholarDigital Library
M. Kolahdouzan and C. Shahabi. Voronoi-based k nearest neighbor search for spatial network databases. In VLDB'04, 840--851, 2004.Google ScholarCross Ref
D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. PVLDB, 1(1):422--433, 2008.Google ScholarDigital Library
Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: bringing order to web objects. In WWW'05, 567--574, 2005.Google ScholarDigital Library
F. Pan, X. Zhang, and W. Wang. Crd: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In SIGMOD'08, 173--184, 2008.Google ScholarDigital Library
J. Shi, and J. Malik Normalized cuts and image segmentation. IEEE Trans. on PAMI, 22(8):888--905, 2000.Google ScholarDigital Library
Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT'09, 565--576, 2009.Google ScholarDigital Library
Y. Sun, J. Han, J. Gao, and Y. Yu. iTopicModel: Information Network-Integrated Topic Modeling. In ICDM'09, 493--502, 2009.Google ScholarDigital Library
H. Tong, C. Faloutsos, J. Pan. Fast Random Walk with Restart and Its Applications. In ICDM'06, 613--622, 2006.Google Scholar
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE'09, 916--927, 2009.Google ScholarDigital Library
X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD'07, 824--833, 2007.Google Scholar

Index Terms

PathSim: meta path-based top-K similarity search in heterogeneous information networks
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Neural PathSim for Inductive Similarity Search in Heterogeneous Information Networks
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

PathSim is a widely used meta-path-based similarity in heterogeneous information networks. Numerous applications rely on the computation of PathSim, including similarity search and clustering. Computing PathSim scores on large graphs is computationally ...
Read More
PathSim: A Tool for Finding Minimal Energy Device Operation Sequence for Reaching a Target Context in a Smart-Home
UIC-ATC '13: Proceedings of the 2013 IEEE 10th International Conference on Ubiquitous Intelligence & Computing and 2013 IEEE 10th International Conference on Autonomic & Trusted Computing

In this paper, aiming to achieve minimal energy consumption in context-aware device control in a smart-home, we propose a tool called PathSim which derives a minimal cost sequence of device operations for moving between two arbitrary contexts. PathSim ...
Read More
A label-setting algorithm for finding a quickest path

The quickest path problem is to find a path to send a given amount of data from the source to the destination with minimum transmission time. To find the quickest path, existing algorithms enumerate nondominated paths with distinct capacity, and then ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 4, Issue 11
August 2011
520 pages
ISSN:2150-8097
Editor:
H. V. Jagadish
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2011
Published in pvldb Volume 4, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 518
  Total Citations
  View Citations
- 1,826
  Total Downloads
- Downloads (Last 12 months)422
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PathSim: meta path-based top-K similarity search in heterogeneous information networks

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Neural PathSim for Inductive Similarity Search in Heterogeneous Information Networks

PathSim: A Tool for Finding Minimal Energy Device Operation Sequence for Reaching a Target Context in a Smart-Home

A label-setting algorithm for finding a quickest path

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PathSim: meta path-based top-K similarity search in heterogeneous information networks

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Neural PathSim for Inductive Similarity Search in Heterogeneous Information Networks

PathSim: A Tool for Finding Minimal Energy Device Operation Sequence for Reaching a Target Context in a Smart-Home

A label-setting algorithm for finding a quickest path

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media