skip to main content
10.1145/304182.304224acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

Automatic discovery of language models for text databases

Authors Info & Claims
Published:01 June 1999Publication History

ABSTRACT

The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate databases for an information need. Current practice is that each database provides its language model upon request, but this cooperative approach has important limitations.

This paper demonstrates that cooperation is not required. Instead, the database selection service can construct its own language models by sampling database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents.

References

  1. 1.j. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.P. B. Danzig, J. Ahn, J. NoIl, and K. Obraczka. Distributed inde~:ing: A scalable mechanism for distributed information retrieval, in Proceedings of the Fourteenth Annur,:l International A CM/SIGIR Conference on Research and Development in Information Retrieval, pages 220-229, Chicago, IL, October 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.S. T. Dumais. Latent semantic indexing (LSI) and TREC-2. In D. K. Harman, editor, The Second Text REtrieval Conference (TREC-2), pages 105- 115, Gaithersburg, MD, 1994. National Institute of Standards and Technology, Special Publication 500- 215.Google ScholarGoogle Scholar
  4. 4.J.C. French, J.C. Powell, C.L. Viles, T. Emmitt, and K.J. Prey. Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International A CM SIGIR Con}erenc.c on Research and Development in Information Retrieval. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.L. Gravano, K. Change, H. Garc~a-Molina, and A. Paepcke. STARTS Stanford proposal for inte:cnet meta-searching. In Proceedings of the A CM-SIGMOD International Conference on Management of Data, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.L. Gravano and H. Garc{a-Molina. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78--89, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.L. Gravaao, H. Garc~a-Molina, and A. Tomasic. The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the A CM-SIGMOD international Conference on Managemen~ of Data, pages 126-137. ACM, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.D. Harman, editor. Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500- 225, Gaithersburg, MD, 1995.Google ScholarGoogle Scholar
  9. 9.H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research, 2:159-165, 1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.R. S. Marcus. An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381-404, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  12. 12.M.J. Moroney, editor. Facts .from figures. Penguin, Baltimore, 1951.Google ScholarGoogle Scholar
  13. 13.National Information Standards Organization. In}brmarion Retrieval (Z39.50): Application Services Definition and Protocol Specification (ANSI/NISO Z39.50- 1995). NISO Press, Bethesda, MD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.E.M. Voorhees, N.K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 172-179, Seattle, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.J. Xu and J. Callan. Effective retrieval of distributed collections. In Proceedings of the 21st Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 112-120, Melbourne, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.G.K. Zipf. Human Behavior and the Principle of Least EITort: An Introduction to Human Ecology. Addison- Wesley, Reading, MA, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Automatic discovery of language models for text databases

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
          June 1999
          604 pages
          ISBN:1581130848
          DOI:10.1145/304182

          Copyright © 1999 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 1999

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader