Explorations in Automatic Thesaurus Discovery

verfasst von: Gregory Grefenstette

Verlag: Springer US

Buchreihe : The International Series in Engineering and Computer Science

Enthalten in: Professional Book Archive

Einloggen, um Zugang zu erhalten

Über dieses Buch

Explorations in Automatic Thesaurus Discovery presents an automated method for creating a first-draft thesaurus from raw text. It describes natural processing steps of tokenization, surface syntactic analysis, and syntactic attribute extraction. From these attributes, word and term similarity is calculated and a thesaurus is created showing important common terms and their relation to each other, common verb--noun pairings, common expressions, and word family members.
The techniques are tested on twenty different corpora ranging from baseball newsgroups, assassination archives, medical X-ray reports, abstracts on AIDS, to encyclopedia articles on animals, even on the text of the book itself. The corpora range from 40,000 to 6 million characters of text, and results are presented for each in the Appendix.
The methods described in the book have undergone extensive evaluation. Their time and space complexity are shown to be modest. The results are shown to converge to a stable state as the corpus grows. The similarities calculated are compared to those produced by psychological testing. A method of evaluation using Artificial Synonyms is tested. Gold Standards evaluation show that techniques significantly outperform non-linguistic-based techniques for the most important words in corpora.
Explorations in Automatic Thesaurus Discovery includes applications to the fields of information retrieval using established testbeds, existing thesaural enrichment, semantic analysis. Also included are applications showing how to create, implement, and test a first-draft thesaurus.

Inhaltsverzeichnis

Frontmatter

1. Introduction

Abstract

The major problem with access to textual information via computers is that of word choice, a problem generated by the basic human ability to express one concept in a variety of ways. This variability raises the question of how the computer can know that words a person uses are related to words found in stored text? Any computer-based system that employs a natural, rather than an artificial, language in its dialogue with human users is faced with this problem.

Gregory Grefenstette

2. Semantic Extraction

Abstract

The real impetus for development of computer-based techniques for dealing with semantics was the realization that simple word substitution was inadequate (I.B.M. 1959) for machine translation. It was realized that context influenced word meaning and that each word’s context would have to be taken into account. The early information retrieval community was also interested in semantics classification, but for a different reason. Information retrieval, since it arose as a science from library science, which itself had a long history of classification, was interested in implementing an online version of human classification schemes and was concerned in a subsidiary manner with automating this classification process.

Gregory Grefenstette

3. Sextant

Abstract

In the last chapter, we provided motivation for the work that we present here by claiming that, in the face of ever greater electronic creation and manipulation of text, the demand for tools to manage and to structure such text will also grow. We have argued that manual approaches to structuring textual knowledge, though useful and promising, cannot keep pace, or be economically justified. In reviewing past attempts at automatic term association, we reviewed work using textual sources whose structure was known, but observed that such work was necessarily limited to a small finite number of sources such as costly man-made dictionaries and thesauri. Such an observation leads us to a discussion of the philosophy of this research, which we outline now.

Gregory Grefenstette

4. Evaluation

Abstract

The previous chapter described SEXTANT’s partial syntactic extraction technique, and explained how these syntactically derived contexts were used to compare words and produce list of similar words. Visual inspection of these lists gives the intuitive impression that the words on this lists are related. The purpose of this chapter is to demonstrate in some objective manner that the relationships extracted are what are commonly considered considered as semantic relationships.

Gregory Grefenstette

5. Applications

Abstract

In the last two chapters, we described our semantic extraction techniques and showed that the list of similar words extracted corresponded to the types of lists manually created by humans for general English. We argued that the advantage of having an automatic technique that approximates such extraction is that, in addition to being fast and economical, it provides information that is specific to the corpus from which it is derived. Here we present a few possible applications of these extracted relations. In the next section, we describe our experiments with the automatic expansion of queries in a classical information retrieval setting. After that, we present experiments showing how the techniques developed in SEXTANT can be applied to enriching existing knowledge structures. We treat the problem of inserting a new word into its proper place in a thesaurus. These experiments also demonstrate how two knowledge-poor techniques can reinforce each other. Then we show how a deeper exploration of the information extracted by SEXTANT permits the creation of clusters of words along semantic axes. Finally in the last section of this chapter, we organize all the disparate techniques developed throughout this book and demonstrate how the first draft of a corpus-derived thesaurus can be automatically created from raw text.

Gregory Grefenstette

6. Conclusion

Summary

In the preceding chapters we have examined a selective natural language processing approach to extracting corpus-specific semantics. We described the motivation of this approach: the language variability problem that affects any computer-based manipulation of text, e.g., information retrieval, filtering, language understanding, human-computer interfaces, machine translation. This problem generated much research in computer-based semantics, a portion of which we reviewed before we presented our system SEXTANT.

Gregory Grefenstette

Backmatter

Titel: Explorations in Automatic Thesaurus Discovery
verfasst von: Gregory Grefenstette
Verlag: Springer US
Electronic ISBN: 978-1-4615-2710-7
Print ISBN: 978-1-4613-6167-1
DOI: https://doi.org/10.1007/978-1-4615-2710-7

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

1. Introduction

2. Semantic Extraction

3. Sextant

4. Evaluation

5. Applications

6. Conclusion

Backmatter