Searching strategies for the Hungarian language

https://doi.org/10.1016/j.ipm.2007.01.022Get rights and content

Abstract

This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.

Introduction

The majority of European languages belong to the Indo–European family and thus they share various syntactic features as well as words in their basic lexicon, as least from a phonological point of view. The Hungarian, Finnish and Basque languages however have fewer characteristics in common with these languages. The English lexicon for example has only a few words with Hungarian origins (e.g., saber, paprika, goulash), while the Hungarian lexicon contains many more words borrowed from the English language (e.g., modern, interview, sport, jury, pedigree, computer, internet).

During the first CLEF (www.clef-campaign.org) evaluation campaigns (Peters et al., 2006), the emphasis was placed on the Roman (e.g., French, Italian, and Spanish) and Germanic (e.g, German, Dutch, and Swedish) family of languages (Sproat, 1992). From an IR point of view these languages are closer to the English while Hungarian represents a special case, especially given its more complex morphology and agglutinative aspects. Moreover, only a few IR experiments have been conducted with the Hungarian language. In fact, not until 2005 did the CLEF evaluation forum include this language in one of its tracks, when a real and reasonably large test collection respecting the required international standards was developed (Harman, 2005, Buckley and Voorhees, 2005, Gordon and Pathak, 1999). The main objective of our paper is therefore to carry out studies on the Hungarian language. This paper is divided as follows. Section 2 presents the context and related works, while Section 3 depicts the main characteristics of the test collection. Section 4 briefly describes the IR models used during our experiments. Section 5 evaluates three stemming approaches together with a comparison of the retrieval effectiveness of word-based schemes, and those where words are automatic decompounded. The main findings of this paper are summarized in Section 6.

Section snippets

Context and related work

In order to define pertinent matches between search keywords and documents, very frequently occurring terms in any given language are usually removed. These words tend not to have clear and important meanings (e.g., the, in, but, some). For the Hungarian language and following the guidelines suggested by Fox (1990), we first created a list of the top 200 most frequently occurring words found in the corpus, from which certain words were removed (e.g., police, minister, president, Magyar). To

Test collection

The corpus used in our experiments is composed of articles extracted from the newspaper Magyar Hírlap, published in 2002. This corpus was made available for the CLEF evaluation campaigns in 2005 and 2006, and contains 49,530 documents or around 105 MB of data, encoded in UTF-8 format. On average, each article contains about 142 indexing terms (or 108 distinct indexing terms) with a standard deviation of 140 (minimum: 2, maximum 4984). A typical document in this collection begins with a short

IR models

In order to obtain a broader view of the relative merit of the various retrieval models and stemming approaches, we used two vector-space schemes and three probabilistic models. First we adopted the classical tf idf model. In this case the weight attached to each indexing term was the product of its term occurrence frequency (or tfij for indexing term tj in document di) and its inverse document frequency (or idfj). To measure similarities between documents and requests, we computed the inner

Evaluation methodology

To evaluate our various IR schemes, we adopted the mean average precision (MAP) computed by the trec_eval software to measure retrieval performance (based on a maximum of 1000 retrieved records). This performance measure has been used by all evaluation campaigns for more than 15 years in order to objectively compare various IR strategies, particularly regarding their ability to retrieve relevant items (ad hoc tasks) (Braschler and Peters, 2004, Buckley and Voorhees, 2005).

Using the mean as a

Conclusion

In this paper we described the most important linguistic features of the Hungarian language, from an IR perspective. Not only does this language use a relatively large set of unambiguous suffixes, but its morphology is also complex, due to the use of possessive pronouns being sometimes added to the suffix construction. Using a test collection extracted from the CLEF-2005 and 2006 suite containing 98 requests, we evaluated three probabilistic and two vector-space models. When using the

Acknowledgment

This research was supported in part by the Swiss NSF under Grant # 200021-113273.

References (35)

  • M. Gordon et al.

    Finding information on the world wide web: The retrieval effectiveness of search engines

    Information Processing and Management

    (1999)
  • S.E. Robertson et al.

    Experimentation as a way of life: Okapi at trec

    Information Processing and Management

    (2000)
  • J. Savoy

    Statistical inference in retrieval effectiveness evaluation

    Information Processing and Management

    (1997)
  • S. Abdou et al.

    Statistical and comparative evaluation of various indexing and search models

  • G. Amati et al.

    Probabilistic models of information retrieval based on measuring the divergence from randomness

    ACM-Transactions on Information Systems

    (2002)
  • M. Braschler et al.

    Cross-language evaluation forum: Objective, results, achievements?

    IR Journal

    (2004)
  • M. Braschler et al.

    How effective is stemming and decompounding for German text retrieval?

    IR Journal

    (2004)
  • C. Buckley et al.

    New retrieval approaches using smart

  • C. Buckley et al.

    Retrieval system evaluation

  • A. Chen

    Cross-language retrieval experiments at CLEF 2002

  • G.M. Di Nunzio et al.

    Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation

  • C. Fox

    A stop list for general text

    SIGIR Forum

    (1990)
  • Haláscy, P. (2006). Benefits of deep NLP-based lemmatization for information retrieval....
  • D. Harman

    How effective is suffixing?

    Journal of the American Society for Information Science

    (1991)
  • D.K. Harman

    The TREC ad hoc experiments

  • S.P. Harter

    Online information retrieval: concepts, principles and techniques

    (1986)
  • Hiemstra, D. (2000). Using language models for information retrieval. CTIT Ph.D....
  • Cited by (22)

    • HPS: High precision stemmer

      2015, Information Processing and Management
      Citation Excerpt :

      Thus, it is possible to call our stemmer light. It was proven (for example in Dolamic & Savoy (2009), Savoy (2008)) that aggressive stemmers usually perform better in the retrieval context than light stemmers. By more aggressive stemming, the recall rate is increased and the size of the storing index is decreased at the same time.

    • Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

      2014, Knowledge-Based Systems
      Citation Excerpt :

      The TF–IDF (Term Frequency and Inverse Document Frequency) method extracts features by splitting each message into tokens based on spaces, tabs, and symbols [3]. A simpler model that can be used is by only considering individual keywords [4]. Other more complex models include tag-based features [5] and behavior-based features [6].

    • A fuzzy ranking approach for improving search results in Turkish as an agglutinative language

      2012, Expert Systems with Applications
      Citation Excerpt :

      As an illustration, Carlberger et al. used articles taken from a Swedish newspaper as the database in their IR study (Carlberger, Dalianis, Hassel, & Knutsson, 2001). Such as Savoy, who could not find enough resources for his IR studies on Bulgarian and Hungarian, used newspapers, Sega-Standart (2002) and Magyar Hírlap (2002) (Savoy, 2007, 2008). Moreover, Can et al. selected a Turkish newspaper, Milliyet (2001–2005), for his IR study based on Turkish.

    • A framework for investigating search engines' stemming mechanisms: A case study on Bing

      2022, Concurrency and Computation: Practice and Experience
    • Statistical stemmers: A reproducibility study

      2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text