Searching strategies for the Hungarian language
Introduction
The majority of European languages belong to the Indo–European family and thus they share various syntactic features as well as words in their basic lexicon, as least from a phonological point of view. The Hungarian, Finnish and Basque languages however have fewer characteristics in common with these languages. The English lexicon for example has only a few words with Hungarian origins (e.g., saber, paprika, goulash), while the Hungarian lexicon contains many more words borrowed from the English language (e.g., modern, interview, sport, jury, pedigree, computer, internet).
During the first CLEF (www.clef-campaign.org) evaluation campaigns (Peters et al., 2006), the emphasis was placed on the Roman (e.g., French, Italian, and Spanish) and Germanic (e.g, German, Dutch, and Swedish) family of languages (Sproat, 1992). From an IR point of view these languages are closer to the English while Hungarian represents a special case, especially given its more complex morphology and agglutinative aspects. Moreover, only a few IR experiments have been conducted with the Hungarian language. In fact, not until 2005 did the CLEF evaluation forum include this language in one of its tracks, when a real and reasonably large test collection respecting the required international standards was developed (Harman, 2005, Buckley and Voorhees, 2005, Gordon and Pathak, 1999). The main objective of our paper is therefore to carry out studies on the Hungarian language. This paper is divided as follows. Section 2 presents the context and related works, while Section 3 depicts the main characteristics of the test collection. Section 4 briefly describes the IR models used during our experiments. Section 5 evaluates three stemming approaches together with a comparison of the retrieval effectiveness of word-based schemes, and those where words are automatic decompounded. The main findings of this paper are summarized in Section 6.
Section snippets
Context and related work
In order to define pertinent matches between search keywords and documents, very frequently occurring terms in any given language are usually removed. These words tend not to have clear and important meanings (e.g., the, in, but, some). For the Hungarian language and following the guidelines suggested by Fox (1990), we first created a list of the top 200 most frequently occurring words found in the corpus, from which certain words were removed (e.g., police, minister, president, Magyar). To
Test collection
The corpus used in our experiments is composed of articles extracted from the newspaper Magyar Hírlap, published in 2002. This corpus was made available for the CLEF evaluation campaigns in 2005 and 2006, and contains 49,530 documents or around 105 MB of data, encoded in UTF-8 format. On average, each article contains about 142 indexing terms (or 108 distinct indexing terms) with a standard deviation of 140 (minimum: 2, maximum 4984). A typical document in this collection begins with a short
IR models
In order to obtain a broader view of the relative merit of the various retrieval models and stemming approaches, we used two vector-space schemes and three probabilistic models. First we adopted the classical tf idf model. In this case the weight attached to each indexing term was the product of its term occurrence frequency (or tfij for indexing term tj in document di) and its inverse document frequency (or idfj). To measure similarities between documents and requests, we computed the inner
Evaluation methodology
To evaluate our various IR schemes, we adopted the mean average precision (MAP) computed by the trec_eval software to measure retrieval performance (based on a maximum of 1000 retrieved records). This performance measure has been used by all evaluation campaigns for more than 15 years in order to objectively compare various IR strategies, particularly regarding their ability to retrieve relevant items (ad hoc tasks) (Braschler and Peters, 2004, Buckley and Voorhees, 2005).
Using the mean as a
Conclusion
In this paper we described the most important linguistic features of the Hungarian language, from an IR perspective. Not only does this language use a relatively large set of unambiguous suffixes, but its morphology is also complex, due to the use of possessive pronouns being sometimes added to the suffix construction. Using a test collection extracted from the CLEF-2005 and 2006 suite containing 98 requests, we evaluated three probabilistic and two vector-space models. When using the
Acknowledgment
This research was supported in part by the Swiss NSF under Grant # 200021-113273.
References (35)
- et al.
Finding information on the world wide web: The retrieval effectiveness of search engines
Information Processing and Management
(1999) - et al.
Experimentation as a way of life: Okapi at trec
Information Processing and Management
(2000) Statistical inference in retrieval effectiveness evaluation
Information Processing and Management
(1997)- et al.
Statistical and comparative evaluation of various indexing and search models
- et al.
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM-Transactions on Information Systems
(2002) - et al.
Cross-language evaluation forum: Objective, results, achievements?
IR Journal
(2004) - et al.
How effective is stemming and decompounding for German text retrieval?
IR Journal
(2004) - et al.
New retrieval approaches using smart
- et al.
Retrieval system evaluation
Cross-language retrieval experiments at CLEF 2002
Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation
A stop list for general text
SIGIR Forum
How effective is suffixing?
Journal of the American Society for Information Science
The TREC ad hoc experiments
Online information retrieval: concepts, principles and techniques
Cited by (22)
HPS: High precision stemmer
2015, Information Processing and ManagementCitation Excerpt :Thus, it is possible to call our stemmer light. It was proven (for example in Dolamic & Savoy (2009), Savoy (2008)) that aggressive stemmers usually perform better in the retrieval context than light stemmers. By more aggressive stemming, the recall rate is increased and the size of the storing index is decreased at the same time.
Binary PSO with mutation operator for feature selection using decision tree applied to spam detection
2014, Knowledge-Based SystemsCitation Excerpt :The TF–IDF (Term Frequency and Inverse Document Frequency) method extracts features by splitting each message into tokens based on spaces, tabs, and symbols [3]. A simpler model that can be used is by only considering individual keywords [4]. Other more complex models include tag-based features [5] and behavior-based features [6].
A hybrid approach for extracting informative content from web pages
2013, Information Processing and ManagementA fuzzy ranking approach for improving search results in Turkish as an agglutinative language
2012, Expert Systems with ApplicationsCitation Excerpt :As an illustration, Carlberger et al. used articles taken from a Swedish newspaper as the database in their IR study (Carlberger, Dalianis, Hassel, & Knutsson, 2001). Such as Savoy, who could not find enough resources for his IR studies on Bulgarian and Hungarian, used newspapers, Sega-Standart (2002) and Magyar Hírlap (2002) (Savoy, 2007, 2008). Moreover, Can et al. selected a Turkish newspaper, Milliyet (2001–2005), for his IR study based on Turkish.
A framework for investigating search engines' stemming mechanisms: A case study on Bing
2022, Concurrency and Computation: Practice and ExperienceStatistical stemmers: A reproducibility study
2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)