18.3.1 RuThes General Structure
In the construction of RuThes, both popular paradigms for computer thesauri were used: concept-based units, a small set of relation types, and rules for including multiword expressions
as in information retrieval
thesauri; language-motivated units, detailed sets of synonyms, and description of ambiguous words as in wordnets
. Also, some issues of ontology research—for example, concepts as main units, strictness of relation description, necessity for many-step inference—are accounted for (Guarino
1998,
2009).
RuThes is a hierarchical network of concepts. Each concept has a name, relations with other concepts, and a set of language expressions (words, phrases, terms) whose meanings correspond to the concept. The whole set of RuThes’ concepts is subdivided into general lexicon
and sociopolitical thesaurus.
General Lexicon comprises general concepts and words that can be met in various specific domains such as
sozdanie (creation),
udalit’ (remove),
uslovnye (conditional).
Sociopolitical Thesaurus contains thematically oriented lexemes and multiword expressions
as well as domain-specific terms of the broad sociopolitical domain. The whole RuThes thesaurus includes more than 60,000 concepts and more than 200,000 Russian text entries (words and expressions). The published version of RuThes for use in noncommercial applications includes 110,000 text entries (RuThes
2019).
The
sociopolitical domain is the domain of problems, relationships, and situations of the contemporary society (Loukachevitch and Dobrov
2015). Subdomains of the sociopolitical domain are themselves large domains such as economics, law, or international relations, each with its own terminology. However, the specific feature of the sociopolitical domain (and its subdomains) is that most domain terms are known to nonprofessionals. Here, in the sociopolitical domain, the general language and domain terminologies adjoin and mix with each other. At present, the RuThes sociopolitical thesaurus includes terminology from such domains as politics, elections, sociology, demography, social security, civil and criminal law, the court system, banking, security, economics (including macroeconomics, industry, agriculture, and transport), ecology, accidents, sports, culture, and others.
18.3.2 RuThes Units
The RuThes thesaurus is a hierarchy of concepts viewed as units of thought. A concept is associated with the set of language expressions that refer to it in texts. This approach is similar to approaches of traditional information retrieval
construction (NISO
2005). In most cases, concepts should have denotational distinctions from related concepts. Such distinctions can be expressed in a specific set of relationships or associated language expressions:
text entries.
Words and phrases whose meanings refer to the same concepts represented in the thesaurus are called ontological synonyms. Ontological synonyms can comprise sense-related words belonging to different parts of speech (i.e., privatizaciâ [privatization] vs. privatizirovat’ [to privatize]); in contrast to traditional terminological resources and information retrieval thesauri that contain mainly nouns or noun phrases. A thesaurus for automatic document processing should contain various types of language units. Also, language expressions relating to different linguistic styles, technical terms, and lexical units can be presented as ontological synonyms related to the same concept. For example, the concept Oil industry has the following text entries: neftânaâ promyšlennost’ (oil industry)—neutral, neftânka—slang, nefteprom—abbreviation. Compositional multiword expressions may be included into synonymic sets as well. Each concept should have a clear, univocal, and concise name. Such names often help to express and delimit the denotational scope of the concept. In addition, the concepts’ names can be used in the analysis of the results of automatic document analysis, for example in visualization of trends or as cluster names.
Ontological synonyms, variants of lexical units, and technical terms (Nazarenko and Zargayouna
2009) are collected specially. After a concept has been introduced, an expert searches for all possible synonyms or orthographic variants, single words, and phrases that can be associated with it. These synonymic sets can also include multiple variants of the references to the same concept. For example, the concept
Ohrana prirody (Nature protection) is associated with almost 50 different text entries in Russian, for example
zaŝita prirody (defense of nature)
, sohranenie prirody (maintenance of nature),
zaŝiŝat’ prirodu (to protect nature),
sohranât’ prirodu (to maintain nature), and others. These variants are useful to describe in the thesaurus because they directly refer to their concept. Besides, multiword term variants often contain ambiguous words within themselves. Thus, the inclusion of such term variants decreases the overall lexical ambiguity and facilitates disambiguation
. All variants are collected during the analysis of real texts, usually news articles, legislative acts, or domain-specific documents.
In fact, the introduction of such a concept as
Nature protection corresponds more to information retrieval
thesauri than wordnets
, because one of the important principles of WordNet-like resources is to include single words and lexicalized phrases into synsets (Bentivogli and Pianta
2004; Maziarz and Piasecki
2018). The phrase
nature protection seems compositional, but the concept
Nature protection is significant for the contemporary life of the society and it has relations with other important concepts of the sociopolitical domain.
As can be seen, one of the difficult issues in developing application-oriented resources, such as wordnets
or information retrieval
thesauri is the inclusion of units (synsets or descriptors) based on the senses of multiword expressions
, for example noun compounds (Bentivogli and Pianta
2004). Manuals and standards for information retrieval
thesaurus development provide detailed principles for multiword term selection (NISO
2005; Aitchinson and Gilchrist
1987). In RuThes, the introduction of concepts based on multiword expressions
is not restricted but encouraged if this concept adds some new information to the knowledge described in the thesaurus (Loukachevitch and Lashevich
2016).
18.3.3 RuThes Relations
Conceptual relations in the thesaurus may be utilized for several purposes, including query expansion in information retrieval, clustering related concepts mentioned in a text as a basis for better recognition of the main theme and subthemes in the document, and disambiguation of ambiguous terms and lexical units. Working with such a broad scope of concepts, we utilize a set of relations that can be applied to concepts in various domains, in contrast to domain-dependent relations.
RuThes has a small set of conceptual relations consisting of four main relations that describe the most important links of a concept. In fact, the current set of relations in the thesaurus is a more ontologically motivated variant of classic inter-descriptor relationships in information retrieval thesauri, which usually include hierarchical relations, such as broader term (BT) and narrower term (NT), and associative relations—related term (RT).
The first relation of RuThes is
the class-subclass relation as it is treated in ontological approaches (Guarino
1998; Gangemi et al.
2003). To establish such relations, we apply tests similar to those used in ontology development. The tests are directed toward avoiding incorrect use of class-subclass relations and not mixing them up with other types of relations (such as type-role relation, class-instance relation), because errors in relation types degrade logical inference (Gangemi et al.
2003). The class-subclass relationship is considered as a transitive relation with the inheritance property.
The second relationship is
part-whole relation, which is established using specific ontological restrictions (Gangemi et al.
2003). Our decision on part-whole relations is based on the following principles:
-
Broad treatment of part-whole relations from the semantic point of view,
-
Restriction of ontological subtypes of part-whole relations,
-
Postulating the transitivity of part-whole relations.
Part-whole relations in RuThes comprise such relationships as parts of physical objects, territorial and geographical parts, process parts, and others (see examples in Table
18.1). Also, some other relationships are presented as part-whole relations in RuThes: an attribute and its bearer, a role or a participant in the situation (Winston et al.
1987, 27–28), entities and situations in the encompassing sphere of activity (Table
18.1).
Table 18.1
Types and examples of part-whole relations in RuThes
Parts of physical objects | starter dvigatelâ (motor starter) kost’ (bone) | dvigatel’ vnutrennego sgoraniâ (internal combustion engine), skelet (skeleton) |
Territorial and geographical parts | oazis (oasis) izbiratel’nyj učastok (electoral precinct) bankovskij sejf (bank safe)— | pustynâ (desert), izbiratel’nyj okrug (electoral district), bankovskoe hraniliŝe (bank vault) |
Process parts | izbiratel’naâ tehnologiâ (electoral technology) | predvybornaâ kampaniâ (pre-election campaign) |
Text and musical parts | vvedenie (text introduction) muzykal’nyj interval (musical interval) | tekst (text), muzykal’naâ kompoziciâ (musical composition) |
Members | člen političeskoj partii (political party member) deputat Gosudarstvennoj Dumy (State Duma Deputy) | političeskaâ partiâ (political party), Gosudarstvennaâ Duma (State Duma, the lower house of the Russian Parliament) |
Substance as a part | židkost’ v organizme (body fluids) | telo (body of living organism) |
An attribute and its bearer | skorost’ (speed) glasnost’ vyborov (election publicity) | dviženie (movement), vybory (election) |
Roles and participants in a situation | investor (investor) igrok (player) | investirovanie (investing), igra (game) |
Entities and situations in the encompassing sphere of activity | zavod (industrial plant) sportsmen (sportsman) | promyšlennost’ (industry), sport (sport) |
In such a broad scope, part-whole relations described in RuThes are close to the so-called
internal relations (parthood, constitution, quality inherence, and participation) as described by Guarino (
2009). At the same time, part-whole relations in RuThes have a very important restriction (correlating with the information retrieval
thesauri guidelines about the necessity to describe only inherent properties as hierarchical relations [NISO
2005]): a concept-part should be related to its whole during the normal existence of its instances: the so-called
ontological dependence.
To analyze the ontological dependence
between entities
X and
Y, it is necessary to determine whether entity
X can exist by itself or whether its existence depends on the existence of
Y. We describe the following types of dependent parts in RuThes:
Thus, we put existential constraints on the part-whole relations in RuThes. These constraints do not change the transitivity
of part-whole relations if it was postulated. The inference mechanism can thereby utilize the transitivity
of part-whole relations and rely on the chain of part-whole relations (Guizzardi
2011; Loukachevitch and Dobrov
2015).
The final types of relationships are nonsymmetrical and symmetrical associations, which are subdivided from the symmetric related term (RT) relation of conventional information retrieval thesauri. The nonsymmetrical associations are established on the basis of the ontological dependence of concepts. Symmetrical associations are described in the very restricted number of cases.
Associative relationships (RT relations) are quite common in information retrieval
thesauri; they are established to provide additional links between descriptors for use in the indexing
or retrieval of documents (NISO
2005). Such relations in information retrieval
thesauri are always considered as symmetrical; however, many associative relations found in published thesauri demonstrate the evident absence of symmetry, for example
illness—
disease prevention,
illness—
sick leave (EUROVOC), et cetera. The first term in each pair is much more general than the other one.
Considering the problems involved in formalizing traditional information retrieval
thesauri to adapt them to the contemporary level of ontological research, some authors propose changing the thesaurus’s traditional system of relations to a formalized set of predicates and to provide axioms for such a set (Soergel et al.
2004). However, in creating such multidomain resources as RuThes, it is very difficult to find the universal set of semantic
relations and apply them consistently. Therefore, we substituted the traditional thesaurus relation of symmetric association with another quite generalized relation, which can be applied in many various domains. We usually refer to this relation as a nonsymmetrical association,
asc1‐
asc2. The definition of this relation is again based on a variant of ontological dependence
, the so-called
external dependence in ontological terms (Gangemi et al.
2003; Guarino
2009). This relation is established between two concepts
c1 and
c2 when two requirements are fulfilled:
-
Neither class-subclass nor part-whole relations can be established between c1 and c2 in the thesaurus.
-
The following assertion is true: “concept c2 exists” means “concept c1 exists” (necessarily existent entities are excluded from consideration).
These two conditions mean that the concept
c2(dependent concept) externally depends on
c1 :
asc1(
c2,
c1) =
asc2(
c1,
c2). Table
18.2 presents some examples of conceptual relationships, where conceptual dependence can be seen.
Table 18.2
Examples of conceptual dependence relations denoted as nonsymmetrical associations in RuThes
Instrument—professional that uses this instrument | skripka (violin) | skripač (violinist) |
Entity—branch of science that studies such entities | životnoe (animal) serdce (heart) | zoologiâ (zoology) kardiologiâ (cardiology) |
Entity and related entity | bagaž (luggage) | bagažnaâ karusel’ (luggage carousel) |
Entity and actions that applied to these entities | krov’ (blood) eda (food) | donorstvo krovi (blood donation) žarka (frying) |
Entity and its specific problems | les (forest) serdce (heart) | lesnoj požar (forest fire) bolezn’ serdca (heart disease) |
Entity and opposing entity or action | virus (virus) | antivirus (antivirus) |
Relations of ontological dependence
are applicable to various domains; therefore, they are usually used in top-level ontologies (Gangemi et al.
2003). An additional advantage of using these relations in thesauri for automatic document processing
is their usefulness for describing links between a concept based on the sense of a compositional multiword expression
and concepts corresponding to the components of this multiword expression
. As a result, a multiword-based concept (e.g.,
Automobile racing) is described as the dependent concept and its component concept (
Automobile) as the main concept. This allows us to introduce concepts based on various types of multiword expressions
and to establish their necessary relations.