skip to main content
10.1145/1076034.1076059acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

SimFusion: measuring similarity using unified relationship matrix

Published:15 August 2005Publication History

ABSTRACT

In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user click-through sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com- bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.

References

  1. D. Beeferman and A. Berger. "Agglomerative clustering of a search engine query log", in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, pp. 407--415, Aug. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. L. Brauen, "Document Vector Modification", in The Smart Retrieval System-Experiments in Automatic Document Processing, G. Salton, editor, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, Chapter 24, 1971.Google ScholarGoogle Scholar
  3. S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30, pp. 107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Bush, "As We May Think", The Atlantic Monthly, vol. 176, pp.101--108, July 1945.Google ScholarGoogle Scholar
  5. P. Calado and B. Ribeiro-Neto, "An Information Retrieval Approach for Approximate Queries," IEEE Transactions on Knowledge and Data Engineering, 15: 236--239, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chakrabarti, B.E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg, "Mining the Web's Link Structure". IEEE Computer, 32 (8)., pp. 60--67, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Das, H. Mannila, P. Ronkainen, "Similarity of attributes by external probes", in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 23--29, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. D. Davison, "Toward a unification of text and link analysis." in Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, Toronto, Canada, pp. 367--368. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat, " MapReduce: Simplified Data Processing on Large Clusters", in Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, pp. 137--150, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and M.R. Henzinger. "Finding Related Pages in the World Wide Web", in Proceedings of the 8th international conference on World Wide Web, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman "Using latent semantic analysis to improve information retrieval." In Proceedings of the Conference on Human Factors in Computing Systems, pp. 281--285, Washington D.C., May, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Fuhr and T. Rolleke, "A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems," ACM Transactions on Information Systems, vol. 15, pp. 32--66, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Gibson, J. Kleinberg, and P. Ragavan, "Clustering Categorical Data: An Approach Based on Dynamic Systems", in Proceedings of the 24th International Conference on Very Large Databases, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, "An algorithmic framework for performing collaborative filtering", in Proc. 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 230--237, Berkeley, California, July, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Jeh and J. Widom. "SimRank: a measure of structural-context similarity". In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 538--543. Edmonton, Alberta, Canada, July 23-26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. O. Kallenberg, Foundations of Modern Probability. New York: Springer-Verlag, 1997.Google ScholarGoogle Scholar
  17. M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.Google ScholarGoogle ScholarCross RefCross Ref
  18. J.M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment". JACM, 46 (5), pp. 604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R.R. Larson. "Bibliometrics of the World-Wide Web: An exploratory analysis of the intellectual structure of cyberspace", in Proceedings of the Annual Meeting of the American Society for Information Science. Baltimore, Maryland, October, 1996.Google ScholarGoogle Scholar
  20. N. Liu et al. "A similarity reinforcement algorithm for heterogeneous Web Pages", in Proceedings of the seventh Asia Pacific Web Conference, Shanghai, March, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Popescul, G. Flake, S. Lawrence, L.H. Ungar, and C.L. Giles. "Clustering and identifying temporal trends in document database", in Proceedings of the IEEE Advances in Digital Libraries, Washington, D.C., May, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V.V. Raghavan and H. Sever. "On the reuse of past optimal queries", in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA. pp. 344--350, July, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Resnick, and H. R. Varian, "Recommender Systems" (introduction to special section). Communications of the ACM, 40(3):56--58, March, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Ribeiro-Neto and R. Muntz, "A Belief Network Model for IR", in Proceedings of the 19th ACM-SIGIR conference on research and development in information retrieval, pp. 253--260, Zurich, Switzerland, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J.J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, NJ, 1971.Google ScholarGoogle Scholar
  26. G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents, J. of the American Society of Information Science 24:265--269, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  28. H. Turtle and W. B. Croft, "Inference networks for document retrieval", in Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J.-R.Wen, J.-Y. Nie, and H.-J. Zhang, "Query Clustering Using User Logs", ACM Transactions on Information Systems (TOIS), 20 (1): 59--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. K. M. Wong, W. Ziarko, V. V. Raghavan, and P. C. N. Wong, "On Modeling of Information Retrieval Concepts in Vector Space", ACM TODS, 12: 299--321, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Xi, et al.,"Link Fusion: A Unified Link Analysis Framework for Multi-type Inter-related Data Objects", in Proceedings of the 13th International World Wide Web Conference, pp.319--327, New York, May 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Xi, B. Zhang and E. A. Fox "SimFusion, A Unified Similarity Measurement Algorithm for Multi-type Interrelated Web Objects", Technical Report, TR-04-19, Computer Science Department, Virginia Tech, Dec. 2004.Google ScholarGoogle Scholar
  33. G. Xue, et al., "MRSSA: An Iterative Algorithm for Similarity Spreading over Interrelated Objects", in Proceedings of the 13th Conference on Information and Knowledge Management (CIKM2004), pp. 240--241, Washington, D.C., Nov, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SimFusion: measuring similarity using unified relationship matrix

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
        August 2005
        708 pages
        ISBN:1595930345
        DOI:10.1145/1076034

        Copyright © 2005 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 August 2005

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader