To read this content please select one of the options below:

DOCUMENT RETRIEVAL EXPERIMENTS USING INDEXING VOCABULARIES OF VARYING SIZE. II. HASHING, TRUNCATION, DIGRAM AND TRIGRAM ENCODING OF INDEX TERMS

PETER WILLETT (Postgraduate School of Librarianship and Information Science, University of Sheffield)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 April 1979

62

Abstract

This paper describes the use of fixed‐length character strings for controlling the size of indexing vocabularies in reference retrieval systems. Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results. Hashing of terms gives a better performance than that obtained from a vocabulary of comparable size produced by right‐hand truncation. The application of small indexing vocabularies to the sequential searching of large document files is discussed.

Citation

WILLETT, P. (1979), "DOCUMENT RETRIEVAL EXPERIMENTS USING INDEXING VOCABULARIES OF VARYING SIZE. II. HASHING, TRUNCATION, DIGRAM AND TRIGRAM ENCODING OF INDEX TERMS", Journal of Documentation, Vol. 35 No. 4, pp. 296-305. https://doi.org/10.1108/eb026684

Publisher

:

MCB UP Ltd

Copyright © 1979, MCB UP Limited

Related articles