ABSTRACT
In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
- D. Beeferman and A. Berger. "Agglomerative clustering of a search engine query log", in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 407--415, Aug. 2000. Google ScholarDigital Library
- T. L. Brauen, "Document Vector Modification", in The Smart Retrieval System-Experiments in Automatic Document Processing, G. Salton, editor, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, Chapter 24, 1971.Google Scholar
- S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30, pp. 107--117, 1998. Google ScholarDigital Library
- V. Bush, "As We May Think", The Atlantic Monthly, vol. 176, pp.101--108, July 1945.Google Scholar
- P. Calado and B. Ribeiro-Neto, "An Information Retrieval Approach for Approximate Queries," IEEE Transactions on Knowledge and Data Engineering, 15: 236--239, 2003. Google ScholarDigital Library
- S. Chakrabarti, B.E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg, "Mining the Web's Link Structure". IEEE Computer, 32 (8)., pp. 60--67, 1999. Google ScholarDigital Library
- G. Das, H. Mannila, P. Ronkainen, "Similarity of attributes by external probes", in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 23--29, 1998.Google ScholarDigital Library
- B. D. Davison, "Toward a unification of text and link analysis." in Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp. 367--368. 2003. Google ScholarDigital Library
- J. Dean and S. Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters", in Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, pp. 137--150, Dec. 2004. Google ScholarDigital Library
- J. Dean and M.R. Henzinger. "Finding Related Pages in the World Wide Web", in Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarDigital Library
- S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman "Using latent semantic analysis to improve information retrieval." In Proceedings of the Conference on Human Factors in Computing Systems, pp. 281--285, Washington D.C., May, 1988. Google ScholarDigital Library
- N. Fuhr and T. Rolleke, "A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems," ACM Transactions on Information Systems, vol. 15, pp. 32--66, 1997. Google ScholarDigital Library
- D. Gibson, J. Kleinberg, and P. Ragavan, "Clustering Categorical Data: An Approach Based on Dynamic Systems", in Proceedings of the 24th International Conference on Very Large Databases, 1998. Google ScholarDigital Library
- J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering", in Proc. 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 230--237, Berkeley, California, July, 1999. Google ScholarDigital Library
- J. Jeh and J. Widom. "SimRank: a measure of structural-context similarity". In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 538--543. Edmonton, Alberta, Canada, July 23-26, 2002. Google ScholarDigital Library
- O. Kallenberg, Foundations of Modern Probability. New York: Springer-Verlag, 1997.Google Scholar
- M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarCross Ref
- J.M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment". JACM, 46 (5), pp. 604--632, 1999. Google ScholarDigital Library
- R.R. Larson. "Bibliometrics of the World-Wide Web: An exploratory analysis of the intellectual structure of cyberspace", in Proceedings of the Annual Meeting of the American Society for Information Science. Baltimore, Maryland, October, 1996.Google Scholar
- N. Liu et al. "A similarity reinforcement algorithm for heterogeneous Web Pages", in Proceedings of the seventh Asia Pacific Web Conference, Shanghai, March, 2005. Google ScholarDigital Library
- A. Popescul, G. Flake, S. Lawrence, L.H. Ungar, and C.L. Giles. "Clustering and identifying temporal trends in document database", in Proceedings of the IEEE Advances in Digital Libraries, Washington, D.C., May, 2000. Google ScholarDigital Library
- V.V. Raghavan and H. Sever. "On the reuse of past optimal queries", in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA. pp. 344--350, July, 1995. Google ScholarDigital Library
- P. Resnick, and H. R. Varian, "Recommender Systems" (introduction to special section). Communications of the ACM, 40(3):56--58, March, 1997. Google ScholarDigital Library
- B. Ribeiro-Neto and R. Muntz, "A Belief Network Model for IR", in Proceedings of the 19th ACM-SIGIR conference on research and development in information retrieval, pp. 253--260, Zurich, Switzerland, 1996. Google ScholarDigital Library
- J.J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, NJ, 1971.Google Scholar
- G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, 1968. Google ScholarDigital Library
- H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents, J. of the American Society of Information Science 24:265--269, 1973.Google ScholarCross Ref
- H. Turtle and W. B. Croft, "Inference networks for document retrieval", in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, 1990. Google ScholarDigital Library
- J.-R.Wen, J.-Y. Nie, and H.-J. Zhang, "Query Clustering Using User Logs", ACM Transactions on Information Systems (TOIS), 20 (1): 59--81. Google ScholarDigital Library
- S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong, "On Modeling of Information Retrieval Concepts in Vector Space", ACM TODS, 12: 299--321, 1987. Google ScholarDigital Library
- W. Xi, et al.,"Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects", in Proceedings of the 13th International World Wide Web Conference, pp.319--327, New York, May 2004. Google ScholarDigital Library
- W. Xi, B. Zhang and E. A. Fox "SimFusion, A Unified Similarity Measurement Algorithm for Multi-type Interrelated Web Objects", Technical Report, TR-04-19, Computer Science Department, Virginia Tech, Dec. 2004.Google Scholar
- G. Xue, et al., "MRSSA: An Iterative Algorithm for Similarity Spreading over Interrelated Objects", in Proceedings of the 13th Conference on Information and Knowledge Management (CIKM2004), pp. 240--241, Washington, D.C., Nov, 2004. Google ScholarDigital Library
Index Terms
- SimFusion: measuring similarity using unified relationship matrix
Recommendations
SimFusion+: extending simfusion towards efficient estimation on large and dynamic networks
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalSimFusion has become a captivating measure of similarity between objects in a web graph. It is iteratively distilled from the notion that "the similarity between two objects is reinforced by the similarity of their related objects". The existing ...
Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval
ICIIC '10: Proceedings of the 2010 First International Conference on Integrated Intelligent ComputingThis paper presents the results of an experimental study of some similarity measures used for both Information Retrieval and Document Clustering. Our results indicate that the cosine similarity measure is superior than the other measures such as Jaccard ...
A polygraph test for trustworthy structural similarity
Do similarity or distance measures ever go wrong? The inherent subjectivity in similarity discernment has long supported the view that all judgements of similarity are equally valid, and that any selected similarity measure may only be considered more ...
Comments