ABSTRACT
The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate databases for an information need. Current practice is that each database provides its language model upon request, but this cooperative approach has important limitations.
This paper demonstrates that cooperation is not required. Instead, the database selection service can construct its own language models by sampling database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents.
- 1.j. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, 1995. ACM. Google ScholarDigital Library
- 2.P. B. Danzig, J. Ahn, J. NoIl, and K. Obraczka. Distributed inde~:ing: A scalable mechanism for distributed information retrieval, in Proceedings of the Fourteenth Annur,:l International A CM/SIGIR Conference on Research and Development in Information Retrieval, pages 220-229, Chicago, IL, October 1991. ACM. Google ScholarDigital Library
- 3.S. T. Dumais. Latent semantic indexing (LSI) and TREC-2. In D. K. Harman, editor, The Second Text REtrieval Conference (TREC-2), pages 105- 115, Gaithersburg, MD, 1994. National Institute of Standards and Technology, Special Publication 500- 215.Google Scholar
- 4.J.C. French, J.C. Powell, C.L. Viles, T. Emmitt, and K.J. Prey. Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International A CM SIGIR Con}erenc.c on Research and Development in Information Retrieval. ACM, 1998. Google ScholarDigital Library
- 5.L. Gravano, K. Change, H. Garc~a-Molina, and A. Paepcke. STARTS Stanford proposal for inte:cnet meta-searching. In Proceedings of the A CM-SIGMOD International Conference on Management of Data, 1997. Google ScholarDigital Library
- 6.L. Gravano and H. Garc{a-Molina. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78--89, 1995. Google ScholarDigital Library
- 7.L. Gravaao, H. Garc~a-Molina, and A. Tomasic. The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the A CM-SIGMOD international Conference on Managemen~ of Data, pages 126-137. ACM, 1994. Google ScholarDigital Library
- 8.D. Harman, editor. Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500- 225, Gaithersburg, MD, 1995.Google Scholar
- 9.H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarDigital Library
- 10.H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research, 2:159-165, 1958.Google ScholarDigital Library
- 11.R. S. Marcus. An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381-404, 1983.Google ScholarCross Ref
- 12.M.J. Moroney, editor. Facts .from figures. Penguin, Baltimore, 1951.Google Scholar
- 13.National Information Standards Organization. In}brmarion Retrieval (Z39.50): Application Services Definition and Protocol Specification (ANSI/NISO Z39.50- 1995). NISO Press, Bethesda, MD, 1995. Google ScholarDigital Library
- 14.E.M. Voorhees, N.K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 172-179, Seattle, 1995. ACM. Google ScholarDigital Library
- 15.J. Xu and J. Callan. Effective retrieval of distributed collections. In Proceedings of the 21st Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 112-120, Melbourne, 1998. ACM. Google ScholarDigital Library
- 16.G.K. Zipf. Human Behavior and the Principle of Least EITort: An Introduction to Human Ecology. Addison- Wesley, Reading, MA, 1949.Google Scholar
Index Terms
- Automatic discovery of language models for text databases
Recommendations
Automatic discovery of language models for text databases
The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such ...
Structural matching and discovery in document databases
Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be ...
Structural matching and discovery in document databases
SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of dataStructural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be ...
Comments