Automatic discovery of language models for text databases

Authors:
Jamie Callan

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts
View Profile

,
Margaret Connell

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts
View Profile

,
Aiqun Du

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts

Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Massachusetts
View Profile

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of dataJune 1999Pages 479–490https://doi.org/10.1145/304182.304224

Published:01 June 1999Publication History

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

Pages 479–490

ABSTRACT

The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such as GIOSS can provide assistance by automatically selecting appropriate databases for an information need. Current practice is that each database provides its language model upon request, but this cooperative approach has important limitations.

This paper demonstrates that cooperation is not required. Instead, the database selection service can construct its own language models by sampling database contents via the normal process of running queries and retrieving documents. Although random sampling is not possible, it can be approximated with carefully selected queries. This sampling approach avoids the limitations that characterize the cooperative approach, and also enables additional capabilities. Experimental results demonstrate that accurate language models can be learned from a relatively small number of queries and documents.

References

1.j. P. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, 1995. ACM. Google ScholarDigital Library
2.P. B. Danzig, J. Ahn, J. NoIl, and K. Obraczka. Distributed inde~:ing: A scalable mechanism for distributed information retrieval, in Proceedings of the Fourteenth Annur,:l International A CM/SIGIR Conference on Research and Development in Information Retrieval, pages 220-229, Chicago, IL, October 1991. ACM. Google ScholarDigital Library
3.S. T. Dumais. Latent semantic indexing (LSI) and TREC-2. In D. K. Harman, editor, The Second Text REtrieval Conference (TREC-2), pages 105- 115, Gaithersburg, MD, 1994. National Institute of Standards and Technology, Special Publication 500- 215.Google Scholar
4.J.C. French, J.C. Powell, C.L. Viles, T. Emmitt, and K.J. Prey. Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International A CM SIGIR Con}erenc.c on Research and Development in Information Retrieval. ACM, 1998. Google ScholarDigital Library
5.L. Gravano, K. Change, H. Garc~a-Molina, and A. Paepcke. STARTS Stanford proposal for inte:cnet meta-searching. In Proceedings of the A CM-SIGMOD International Conference on Management of Data, 1997. Google ScholarDigital Library
6.L. Gravano and H. Garc{a-Molina. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78--89, 1995. Google ScholarDigital Library
7.L. Gravaao, H. Garc~a-Molina, and A. Tomasic. The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the A CM-SIGMOD international Conference on Managemen~ of Data, pages 126-137. ACM, 1994. Google ScholarDigital Library
8.D. Harman, editor. Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500- 225, Gaithersburg, MD, 1995.Google Scholar
9.H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarDigital Library
10.H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research, 2:159-165, 1958.Google ScholarDigital Library
11.R. S. Marcus. An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381-404, 1983.Google ScholarCross Ref
12.M.J. Moroney, editor. Facts .from figures. Penguin, Baltimore, 1951.Google Scholar
13.National Information Standards Organization. In}brmarion Retrieval (Z39.50): Application Services Definition and Protocol Specification (ANSI/NISO Z39.50- 1995). NISO Press, Bethesda, MD, 1995. Google ScholarDigital Library
14.E.M. Voorhees, N.K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 172-179, Seattle, 1995. ACM. Google ScholarDigital Library
15.J. Xu and J. Callan. Effective retrieval of distributed collections. In Proceedings of the 21st Annual International A CM SIGIR Conference on Research and Development in Information Retrieval, pages 112-120, Melbourne, 1998. ACM. Google ScholarDigital Library
16.G.K. Zipf. Human Behavior and the Principle of Least EITort: An Introduction to Human Ecology. Addison- Wesley, Reading, MA, 1949.Google Scholar

Index Terms

Automatic discovery of language models for text databases
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Database management system engines
  2. Information systems applications

Recommendations

Automatic discovery of language models for text databases

The proliferation of text databases within large organizations and on the Internet makes it difficult for a person to know which databases to search. Given language models that describe the contents of each database, a database selection algorithm such ...
Read More
Structural matching and discovery in document databases

Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be ...
Read More
Structural matching and discovery in document databases
SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data

Structural matching and discovery in documents such as SGML and HTML is important for data warehousing [6], version management [7, 11], hypertext authoring, digital libraries [4] and Internet databases. As an example, a user of the World Wide Web may be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data
June 1999
604 pages
ISBN:1581130848
DOI:10.1145/304182
Chairmen:
Susan B. Davidson
Univ. of Pennsylvania, Philidelphia
,
Christos Faloutsos
Carnegie Mellon Univ., Pittsburgh
ACM SIGMOD Record Volume 28, Issue 2
June 1999
599 pages
ISSN:0163-5808
DOI:10.1145/304181
Chairmen:
Susan Davidson
Univ. of Pennsylvania
,
Christos Faloutsos
Carnegie Mellon Univ.
Issue’s Table of Contents
Copyright © 1999 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1999
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 93
  Total Citations
  View Citations
- 674
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic discovery of language models for text databases

SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic discovery of language models for text databases

Structural matching and discovery in document databases

Structural matching and discovery in document databases