Article

SimFusion: measuring similarity using unified relationship matrix

Authors:
Wensi Xi

Virginia Tech, Blacksburg, VA

Virginia Tech, Blacksburg, VA
View Profile

,
Edward A. Fox

Virginia Tech, Blacksburg, VA

Virginia Tech, Blacksburg, VA
View Profile

,
Weiguo Fan

Virginia Tech, Blacksburg, VA

Virginia Tech, Blacksburg, VA
View Profile

,
Benyu Zhang

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Zheng Chen

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Jun Yan

Beijing University, Beijing, China

Beijing University, Beijing, China
View Profile

,
Dong Zhuang

Beijing Institute of Technology, Beijing, China

Beijing Institute of Technology, Beijing, China
View Profile

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2005Pages 130–137https://doi.org/10.1145/1076034.1076059

Published:15 August 2005Publication History

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 130–137

ABSTRACT

In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.

References

D. Beeferman and A. Berger. "Agglomerative clustering of a search engine query log", in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 407--415, Aug. 2000. Google ScholarDigital Library
T. L. Brauen, "Document Vector Modification", in The Smart Retrieval System-Experiments in Automatic Document Processing, G. Salton, editor, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, Chapter 24, 1971.Google Scholar
S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30, pp. 107--117, 1998. Google ScholarDigital Library
V. Bush, "As We May Think", The Atlantic Monthly, vol. 176, pp.101--108, July 1945.Google Scholar
P. Calado and B. Ribeiro-Neto, "An Information Retrieval Approach for Approximate Queries," IEEE Transactions on Knowledge and Data Engineering, 15: 236--239, 2003. Google ScholarDigital Library
S. Chakrabarti, B.E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg, "Mining the Web's Link Structure". IEEE Computer, 32 (8)., pp. 60--67, 1999. Google ScholarDigital Library
G. Das, H. Mannila, P. Ronkainen, "Similarity of attributes by external probes", in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 23--29, 1998.Google ScholarDigital Library
B. D. Davison, "Toward a unification of text and link analysis." in Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp. 367--368. 2003. Google ScholarDigital Library
J. Dean and S. Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters", in Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, pp. 137--150, Dec. 2004. Google ScholarDigital Library
J. Dean and M.R. Henzinger. "Finding Related Pages in the World Wide Web", in Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarDigital Library
S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman "Using latent semantic analysis to improve information retrieval." In Proceedings of the Conference on Human Factors in Computing Systems, pp. 281--285, Washington D.C., May, 1988. Google ScholarDigital Library
N. Fuhr and T. Rolleke, "A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems," ACM Transactions on Information Systems, vol. 15, pp. 32--66, 1997. Google ScholarDigital Library
D. Gibson, J. Kleinberg, and P. Ragavan, "Clustering Categorical Data: An Approach Based on Dynamic Systems", in Proceedings of the 24th International Conference on Very Large Databases, 1998. Google ScholarDigital Library
J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering", in Proc. 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 230--237, Berkeley, California, July, 1999. Google ScholarDigital Library
J. Jeh and J. Widom. "SimRank: a measure of structural-context similarity". In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 538--543. Edmonton, Alberta, Canada, July 23-26, 2002. Google ScholarDigital Library
O. Kallenberg, Foundations of Modern Probability. New York: Springer-Verlag, 1997.Google Scholar
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarCross Ref
J.M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment". JACM, 46 (5), pp. 604--632, 1999. Google ScholarDigital Library
R.R. Larson. "Bibliometrics of the World-Wide Web: An exploratory analysis of the intellectual structure of cyberspace", in Proceedings of the Annual Meeting of the American Society for Information Science. Baltimore, Maryland, October, 1996.Google Scholar
N. Liu et al. "A similarity reinforcement algorithm for heterogeneous Web Pages", in Proceedings of the seventh Asia Pacific Web Conference, Shanghai, March, 2005. Google ScholarDigital Library
A. Popescul, G. Flake, S. Lawrence, L.H. Ungar, and C.L. Giles. "Clustering and identifying temporal trends in document database", in Proceedings of the IEEE Advances in Digital Libraries, Washington, D.C., May, 2000. Google ScholarDigital Library
V.V. Raghavan and H. Sever. "On the reuse of past optimal queries", in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA. pp. 344--350, July, 1995. Google ScholarDigital Library
P. Resnick, and H. R. Varian, "Recommender Systems" (introduction to special section). Communications of the ACM, 40(3):56--58, March, 1997. Google ScholarDigital Library
B. Ribeiro-Neto and R. Muntz, "A Belief Network Model for IR", in Proceedings of the 19th ACM-SIGIR conference on research and development in information retrieval, pp. 253--260, Zurich, Switzerland, 1996. Google ScholarDigital Library
J.J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, NJ, 1971.Google Scholar
G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, 1968. Google ScholarDigital Library
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents, J. of the American Society of Information Science 24:265--269, 1973.Google ScholarCross Ref
H. Turtle and W. B. Croft, "Inference networks for document retrieval", in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, 1990. Google ScholarDigital Library
J.-R.Wen, J.-Y. Nie, and H.-J. Zhang, "Query Clustering Using User Logs", ACM Transactions on Information Systems (TOIS), 20 (1): 59--81. Google ScholarDigital Library
S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong, "On Modeling of Information Retrieval Concepts in Vector Space", ACM TODS, 12: 299--321, 1987. Google ScholarDigital Library
W. Xi, et al.,"Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects", in Proceedings of the 13th International World Wide Web Conference, pp.319--327, New York, May 2004. Google ScholarDigital Library
W. Xi, B. Zhang and E. A. Fox "SimFusion, A Unified Similarity Measurement Algorithm for Multi-type Interrelated Web Objects", Technical Report, TR-04-19, Computer Science Department, Virginia Tech, Dec. 2004.Google Scholar
G. Xue, et al., "MRSSA: An Iterative Algorithm for Similarity Spreading over Interrelated Objects", in Proceedings of the 13th Conference on Information and Knowledge Management (CIKM2004), pp. 240--241, Washington, D.C., Nov, 2004. Google ScholarDigital Library

Index Terms

SimFusion: measuring similarity using unified relationship matrix
1. Information systems
  1. Information retrieval
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory

Recommendations

SimFusion+: extending simfusion towards efficient estimation on large and dynamic networks
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

SimFusion has become a captivating measure of similarity between objects in a web graph. It is iteratively distilled from the notion that "the similarity between two objects is reinforced by the similarity of their related objects". The existing ...
Read More
Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval
ICIIC '10: Proceedings of the 2010 First International Conference on Integrated Intelligent Computing

This paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard ...
Read More
A polygraph test for trustworthy structural similarity

Do similarity or distance measures ever go wrong? The inherent subjectivity in similarity discernment has long supported the view that all judgements of similarity are equally valid, and that any selected similarity measure may only be considered more ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information integration
information retrieval
simfusion
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 117
  Total Citations
  View Citations
- 1,423
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SimFusion: measuring similarity using unified relationship matrix

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

SimFusion+: extending simfusion towards efficient estimation on large and dynamic networks

Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval

A polygraph test for trustworthy structural similarity