ABSTRACT
In this paper we present a parallel document ranking algorithm suitable for use on databases of 1-1000 GB, resident on primary or secondary storage. The algorithm is based on inverted indexes, and has two advantages over a previously published parallel algorithm for retrieval based on signature files. First, it permits the employment of ranking strategies which cannot be easily implemented using signature files, specifically methods which depend on document-term weighting. Second, it permits the interactive searching of databases resident on secondary storage. The algorithm is evaluated via a mixture of analytic and simulation techniques, with a particular focus on how cost-effectiveness and efficiency change as the size of the database, number of processors, and cost of memory are altered. In particular, we find that if the ratio of the number of processors and/or disks to the size of the database is held constant, then the cost-effectiveness of the resulting system remains constant. Furthermore, for a given size of database, there is a number of processors which optimizes cost-effectiveness. Estimated response times are also presented. Using these methods, it appears that cost-effective interactive access to databases in the 100-1000 GB range can be achieved using current technology.
- 1.Van Rijsbergen, C.J., Information Retrieval, Second Edition, Butterworths, London, 1979. Google ScholarDigital Library
- 2.Salton, G., The SMART Retrieval System -- Experiment in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, New Jersey, 1971. Google ScholarDigital Library
- 3.Christodoulakis, S. and Faloutsos, C., "Design considerations for a message file server," IEEE Transactions on Software Engineering, Volume SE-10, 1984, pp. 201-210.Google ScholarDigital Library
- 4.Faloutsos, C. and Christodoulakis, S., "Signature files: An access method for documents and its analytic performance evaluation," A CM Transactions on Office Information Systems, Volume 2 Number 4, October 1984, pp. 267 - 288. Google ScholarDigital Library
- 5.Tsichritzis, D. and Christodoulakis, S., "Message files," A CM Transactions on Office Information Systems, Volume 11 Number 1, January 1983, pp. 88-98. Google ScholarDigital Library
- 6.Stanfill, C. and Kahle, B., "Parallel Free- Text Search on the Connection Machine System," Communications of the A CM, Volume 29 Number 12, December 1986, pp. 1229- 1238. Google ScholarDigital Library
- 7.Hillis, D., The Connection Machine, MIT Press, Cambridge MA, 1985, Google ScholarDigital Library
- 8."Connection Machine~ Model CM-2 Technical Summary, Technical Report HA87-4, Thinking Machines Corporation, Cambridge MA, 1987.Google Scholar
- 9.Stone, H., "Parallel Querying of Large Databases: A Case Study," IEEE Computer, October 1987, pp. 11-21. Google ScholarDigital Library
- 10.Salton, G. and Buckley, C., "Parallel Text Search Methods," Communications of the ACM, Volume 31 Number 2, February 1988, pp. 202-215. Google ScholarDigital Library
- 11.Croft, Bruce, "Implementing Ranking Strategies Using Text Signatures," A CM Transactions on Office Information Systems, Volume 6 Number 1, January 1988, pp. 42-62. Google ScholarDigital Library
- 12.Stanfill, C., "Parallel Computing for Information Retrieval: Recent Developments," Technical Report DR88-1, Thinking Machines Corporation, Cambridge MA, January 1988.Google Scholar
- 13.Zipf, H.P., Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge MA, 1949.Google Scholar
Index Terms
- A parallel indexed algorithm for information retrieval
Recommendations
A parallel indexed algorithm for information retrieval
Special issue: Proceedings of the 12th annual international ACMSIGIR conference on Research and development in information retrieval, N.J. Belkin and C.J. van Rijsbergen (Eds.), June 25-28, 1989, Cambridge, MA.In this paper we present a parallel document ranking algorithm suitable for use on databases of 1-1000 GB, resident on primary or secondary storage. The algorithm is based on inverted indexes, and has two advantages over a previously published parallel ...
Efficient Directory Mutations in a Full-Path-Indexed File System
Special Issue on FAST 2018 and Regular PapersFull-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local ...
A formal system for information retrieval from files
A generalized file structure is provided by which the concepts of keyword, index, record, file, directory, file structure, directory decoding, and record retrieval are defined and from which some of the frequently used file structures such as inverted ...
Comments