skip to main content
article
Free Access

The impact on retrieval effectiveness of skewed frequency distributions

Published:01 October 1999Publication History
Skip Abstract Section

Abstract

We present an analysis of word senses that provides a fresh insight into the impact of word ambiguity on retrieval effectiveness with potential broader implications for other processes of information retrieval. Using a methodology of forming artifically ambiguous words, known as pseudowords, and through reference to other researchers' work, the analysis illustrates that the distribution of the frequency of occurrance of the senses of a word plays a strong role in ambiguity's impact of effectiveness. Further investigation shows that this analysis may also be applicable to other processes of retrieval, such as Cross Language Information Retrieval, query expansion, retrieval of OCR'ed texts, and stemming. The analysis appears to provide a means of explaining, at least in part, reasons for the processes' impact (or lack of it) on effectiveness.

References

  1. BALLESTEROS,L.AND CROFT, W. B. 1997. Phrasal translation and query expansion techniques for cross-langauge information retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '97, Phila-delphia, PA, July 27-31), N. J. Belkin, A. D. Narasimhalu, P. Willett, W. Hersh, F. Can, and E. Voorhees, Eds, ACM Press, New York, NY, 84-91. Google ScholarGoogle Scholar
  2. BURNETT,J.E.,COOPER, D., LYNCH,M.F.,WILLETT, P., AND WYCHERLEY, M. 1979. Document retrieval experiments using indexing vocabularies of varying size. -1. Variety generation symbols assigned to the fronts of index terms. J. Doc. 35, 3, 197-206.Google ScholarGoogle Scholar
  3. CHURCH, K. W. 1995. One term or two?. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 310-318. Google ScholarGoogle Scholar
  4. CRESTANI, F., SANDERSON, M., THEOPHYLACTOU, M., AND LALMAS, M. 1997. Short queries, natural language and spoken documents retrieval: Experiments at Glasgow University. In Proceedings of the 6th Text Retreival Conference (TREC-6, Nov.), E. Voorhees and D. Harman, Eds.Google ScholarGoogle Scholar
  5. GALE, W., CHURCH,K.W.,AND YAROWSKY, D. 1992a. Work on statistical methods for word sense disambiguation. In Intelligent Probabilistic Approaches to Natural Language Papers from the 1992 Fall Symposium. AAAI Press, Menlo Park, CA, 54-60.Google ScholarGoogle Scholar
  6. GALE, W., CHURCH,K.W.,AND YAROWSKY, D. 1992b. One sense per discourse. In Proceedings of the Workshop on Speech and Natural Language. U.S. Defense Advanced Research Project Agency, Washington, D.C. Google ScholarGoogle Scholar
  7. GALE, W., CHURCH,K.W.,AND YAROWSKY, D. 1992c. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. In Proceedings of the 30th ACL Conference. 249-256. Google ScholarGoogle Scholar
  8. GREFENSTETTE, G. 1994. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Hingham, MA. Google ScholarGoogle Scholar
  9. HARMAN, D. 1987. A failure analysis of the limitation of suffixing in an online environment. In Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '87, New Orleans, LA, June 3-5, 1987), C. T. Yu and C. J. Van Rijsbergen, Eds. ACM Press, New York, NY, 102-107. Google ScholarGoogle Scholar
  10. HARMAN, D. 1992. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W. B. Frakes and R. Baeza-Yates, Eds. Prentice-Hall, Inc., Upper Saddle River, NJ, 363-392. Google ScholarGoogle Scholar
  11. HULL,D.A.AND GREFENSTETTE, G. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual Interna-tional ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '96, Zurich, Switzerland, Aug. 18-22), H.-P. Frei, D. Harman, P. Scha~bie, and R. Wilkinson, Eds. ACM Press, New York, NY, 49-57. Google ScholarGoogle Scholar
  12. KILGARRIFF, A. 1997. I don't believe in word senses. Comput. Hum. 31, 2, 91-113.Google ScholarGoogle Scholar
  13. KROVETZ, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Re-trieval (SIGIR '93, Pittsburgh, PA, June 27-July 1), R. Korfhage, E. Rasmussen, and P. Willett, Eds. ACM Press, New York, NY, 191-202. Google ScholarGoogle Scholar
  14. KROVETZ,R.AND CROFT, W. B. 1992. Lexical ambiguity and information retrieval. ACM Trans. Inf. Syst. 10, 2 (Apr.), 115-141. Google ScholarGoogle Scholar
  15. LESK, M. 1986. Automatic sense disambiguation: How to tell a pine cone from an ice cream cone. In Proceedings of the 1986 SIGDOC Conference. ACM, New York, NY, 24-26. Google ScholarGoogle Scholar
  16. LEWIS, D. D. 1992. Representation and learning in information retrieval. Ph.D. Dissertation. Department of Computer Science, University of Massachusetts, Amherst, MA. Google ScholarGoogle Scholar
  17. MILLER, G. A. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (Nov.), 39-41. Google ScholarGoogle Scholar
  18. NG,H.T.AND LEE, H. B. 1996. Integrating multiple knowledge sources to disambiguate word sense: An Exemplar-based approach. In Proceedings of the 34th ACL Conference. 40-47. Google ScholarGoogle Scholar
  19. PORTER, M. F. 1980. An algorithm for suffix stripping. Program: Autom. Libr. Inf. Syst. 14,3, 130-137.Google ScholarGoogle Scholar
  20. SALTON, G., FOX,E.A.,AND WU, H. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11 (Nov.), 1022-1036. Google ScholarGoogle Scholar
  21. SANDERSON, M. 1994. Word sense disambiguation and information retrieval. In Proceedings of the 17th Annual International ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '94, Dublin, Ireland, July 3-6), W. B. Croft and C. J. van Rijsbergen, Eds. Springer-Verlag, New York, NY, 142-151. Google ScholarGoogle Scholar
  22. SANDERSON, M. 1996. Word sense disambiguaiton and information retrieval. Tech. Rep. TR-1997-7. Deparment of Computing Science, University of Glasgow, Glasgow, UK.Google ScholarGoogle Scholar
  23. SCH~TZE, H. 1992. Context space. In Intelligent Probabilistic Approaches to Natural Language Papers from the 1992 Fall Symposium. AAAI Press, Menlo Park, CA, 113-120.Google ScholarGoogle Scholar
  24. SCH~TZE,H.AND PEDERSEN, J. O. 1995. Information retrieval based on word senses. In Symposium on Document Analysis and Information Retrieval (Las Vegas, NV). 161-175.Google ScholarGoogle Scholar
  25. SMALL,S.AND RIEGER, C. 1982. Parsing and comprehending with word experts (a theory and its realisation). In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, Eds. 89-148.Google ScholarGoogle Scholar
  26. SMEATON,A.F.AND QUIGLEY, I. 1996. Experiments on using semantic distances between words in image caption retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '96, Zurich, Switzerland, Aug. 18-22), H.-P. Frei, D. Harman, P. Scha~bie, and R. Wilkinson, Eds. ACM Press, New York, NY, 174-180. Google ScholarGoogle Scholar
  27. SMEATON,A.F.AND SPITZ, A. L. 1997. Using character shape coding for information retrieval. In Proceedings of the International Conference on Document Analysis and Recognition. Google ScholarGoogle Scholar
  28. SPARCK JONES,K.AND VAN RIJSBERGEN, C. J. 1976. Progress in documentation. J. Doc. 32,1 (Mar.), 59-75.Google ScholarGoogle Scholar
  29. SUSSNA, M. 1993. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM '93, Washington, DC, Nov. 1-5), B. Bhargava, T. Finin, and Y. Yesha, Eds. ACM Press, New York, NY, 67-74. Google ScholarGoogle Scholar
  30. VAN RIJSBERGEN, C. J. 1979. Information Retrieval. 2nd ed. Butterworths, London, UK. Google ScholarGoogle Scholar
  31. VOORHEES, E. M. 1993. Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the 16th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '93, Pittsburgh, PA, June 27-July 1), R. Korfhage, E. Rasmussen, and P. Willett, Eds. ACM Press, New York, NY, 171-180. Google ScholarGoogle Scholar
  32. VOORHEES, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR '94, Dublin, Ireland, July 3-6), W. B. Croft and C. J. van Rijsbergen, Eds. Springer-Verlag, New York, NY, 61-69. Google ScholarGoogle Scholar
  33. VOORHEES,E.AND HARMAN, D. 1996. Overview of the Fifth Text REtrieval Conference (TREC-5). In Proceedings of the 5th Text Retrieval Conference (TREC-5, Gaithersburg, MD, Nov.), E. M. Voorhees and D. K. Harman, Eds. National Institute of Standards and Technology, Gaithersburg, MD.Google ScholarGoogle Scholar
  34. WALLIS, P. 1993. Information retrieval based on paraphrase. In Proceedings of the 1st PACLING Conference.Google ScholarGoogle Scholar
  35. WEISS, S. F. 1973. Learning to disambiguate. Inf. Storage Retrieval 9, 33-41.Google ScholarGoogle Scholar
  36. WILKS, Y., FASS, D., GUO, C., MACDONALD,J.E.,PLATE, T., AND SLATOR, B. M. 1990. Providing machine tractable dictionary tools. Mach. Transl. 5, 2, 99-154.Google ScholarGoogle Scholar
  37. XU,J.AND CROFT, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61-81. Google ScholarGoogle Scholar
  38. YAROWSKY, D. 1993. One sense per collocation. In Proceedings of the ARPA Human Language Technology Workshop. Google ScholarGoogle Scholar
  39. ZIPF, G. K. 1949. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar

Index Terms

  1. The impact on retrieval effectiveness of skewed frequency distributions

          Recommendations

          Reviews

          Richard S. Marcus

          Attempts to incorporate word sense distinctions into the information retrieval process have generally led, counterintuitively, to minimal improvements in retrieval effectiveness. The authors, and others, have speculated that the skewed distribution of senses partly accounts for this unexpected result. In the research reported here, the authors support this hypothesis and further extend their analysis using, in part, the methodology of “pseudowords,” that is, the artificial conflation of several different word types into a single, indistinguishable word type in the index and search operations. The authors carefully explain their methodology and give clear and detailed frequency distributions for real words and pseudowords in several well-known IR databases and associated query sets. Anyone interested in word sense ambiguities in information retrieval or other applications should find the methodology and results detailed in this paper of interest. I believe the potential utility of making word-sense distinctions is underestimated in this paper because the retrieval paradigm employed by these researchers—and used in other work they cite—is mainly in the statistical/probabilistic/vec tor model. For example, the authors state that they agree with a finding that the stemming operation has little utility for retrieval. This may often be true for the statistical paradigm, but is contradicted in other paradigms, such as the structured, contextual, interactive Boolean (SCIB) paradigm. The difference seems to be that more structured paradigms, such as SCIB, are more sensitive to such conceptual and sense distinctions than the statistical paradigm, which tends to mix up all words into an unstructured melange. In any case, however, it is acknowledged that word sense distinctions will become more useful as text processing becomes more sophisticated.

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader