Searching strategies for the Hungarian language

doi:10.1016/j.ipm.2007.01.022

Information Processing & Management

Volume 44, Issue 1, January 2008, Pages 310-324

https://doi.org/10.1016/j.ipm.2007.01.022 Get rights and content

Abstract

This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.

Introduction

The majority of European languages belong to the Indo–European family and thus they share various syntactic features as well as words in their basic lexicon, as least from a phonological point of view. The Hungarian, Finnish and Basque languages however have fewer characteristics in common with these languages. The English lexicon for example has only a few words with Hungarian origins (e.g., saber, paprika, goulash), while the Hungarian lexicon contains many more words borrowed from the English language (e.g., modern, interview, sport, jury, pedigree, computer, internet).

During the first CLEF (www.clef-campaign.org) evaluation campaigns (Peters et al., 2006), the emphasis was placed on the Roman (e.g., French, Italian, and Spanish) and Germanic (e.g, German, Dutch, and Swedish) family of languages (Sproat, 1992). From an IR point of view these languages are closer to the English while Hungarian represents a special case, especially given its more complex morphology and agglutinative aspects. Moreover, only a few IR experiments have been conducted with the Hungarian language. In fact, not until 2005 did the CLEF evaluation forum include this language in one of its tracks, when a real and reasonably large test collection respecting the required international standards was developed (Harman, 2005, Buckley and Voorhees, 2005, Gordon and Pathak, 1999). The main objective of our paper is therefore to carry out studies on the Hungarian language. This paper is divided as follows. Section 2 presents the context and related works, while Section 3 depicts the main characteristics of the test collection. Section 4 briefly describes the IR models used during our experiments. Section 5 evaluates three stemming approaches together with a comparison of the retrieval effectiveness of word-based schemes, and those where words are automatic decompounded. The main findings of this paper are summarized in Section 6.

Section snippets

Context and related work

In order to define pertinent matches between search keywords and documents, very frequently occurring terms in any given language are usually removed. These words tend not to have clear and important meanings (e.g., the, in, but, some). For the Hungarian language and following the guidelines suggested by Fox (1990), we first created a list of the top 200 most frequently occurring words found in the corpus, from which certain words were removed (e.g., police, minister, president, Magyar). To

Test collection

The corpus used in our experiments is composed of articles extracted from the newspaper Magyar Hírlap, published in 2002. This corpus was made available for the CLEF evaluation campaigns in 2005 and 2006, and contains 49,530 documents or around 105 MB of data, encoded in UTF-8 format. On average, each article contains about 142 indexing terms (or 108 distinct indexing terms) with a standard deviation of 140 (minimum: 2, maximum 4984). A typical document in this collection begins with a short

IR models

In order to obtain a broader view of the relative merit of the various retrieval models and stemming approaches, we used two vector-space schemes and three probabilistic models. First we adopted the classical tf idf model. In this case the weight attached to each indexing term was the product of its term occurrence frequency (or tf_ij for indexing term t_j in document d_i) and its inverse document frequency (or idf_j). To measure similarities between documents and requests, we computed the inner

Evaluation methodology

To evaluate our various IR schemes, we adopted the mean average precision (MAP) computed by the trec_eval software to measure retrieval performance (based on a maximum of 1000 retrieved records). This performance measure has been used by all evaluation campaigns for more than 15 years in order to objectively compare various IR strategies, particularly regarding their ability to retrieve relevant items (ad hoc tasks) (Braschler and Peters, 2004, Buckley and Voorhees, 2005).

Using the mean as a

Conclusion

In this paper we described the most important linguistic features of the Hungarian language, from an IR perspective. Not only does this language use a relatively large set of unambiguous suffixes, but its morphology is also complex, due to the use of possessive pronouns being sometimes added to the suffix construction. Using a test collection extracted from the CLEF-2005 and 2006 suite containing 98 requests, we evaluated three probabilistic and two vector-space models. When using the

Acknowledgment

This research was supported in part by the Swiss NSF under Grant # 200021-113273.

References (35)

M. Gordon et al.
Finding information on the world wide web: The retrieval effectiveness of search engines
Information Processing and Management
(1999)
S.E. Robertson et al.
Experimentation as a way of life: Okapi at trec
Information Processing and Management
(2000)
J. Savoy
Statistical inference in retrieval effectiveness evaluation
Information Processing and Management
(1997)
S. Abdou et al.
Statistical and comparative evaluation of various indexing and search models
G. Amati et al.
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM-Transactions on Information Systems
(2002)
M. Braschler et al.
Cross-language evaluation forum: Objective, results, achievements?
IR Journal
(2004)
M. Braschler et al.
How effective is stemming and decompounding for German text retrieval?
IR Journal
(2004)
C. Buckley et al.
New retrieval approaches using smart
C. Buckley et al.
Retrieval system evaluation
A. Chen
Cross-language retrieval experiments at CLEF 2002

G.M. Di Nunzio et al.

Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation

C. Fox

A stop list for general text

SIGIR Forum

(1990)

Haláscy, P. (2006). Benefits of deep NLP-based lemmatization for information retrieval....

D. Harman

How effective is suffixing?

Journal of the American Society for Information Science

(1991)

D.K. Harman

The TREC ad hoc experiments

S.P. Harter

Online information retrieval: concepts, principles and techniques

(1986)

Hiemstra, D. (2000). Using language models for information retrieval. CTIT Ph.D....

Cited by (22)

HPS: High precision stemmer
2015, Information Processing and Management
Citation Excerpt :
Thus, it is possible to call our stemmer light. It was proven (for example in Dolamic & Savoy (2009), Savoy (2008)) that aggressive stemmers usually perform better in the retrieval context than light stemmers. By more aggressive stemming, the recall rate is increased and the size of the storing index is decreased at the same time.
Research into unsupervised ways of stemming has resulted, in the past few years, in the development of methods that are reliable and perform well. Our approach further shifts the boundaries of the state of the art by providing more accurate stemming results. The idea of the approach consists in building a stemmer in two stages. In the first stage, a stemming algorithm based upon clustering, which exploits the lexical and semantic information of words, is used to prepare large-scale training data for the second-stage algorithm. The second-stage algorithm uses a maximum entropy classifier. The stemming-specific features help the classifier decide when and how to stem a particular word.
In our research, we have pursued the goal of creating a multi-purpose stemming tool. Its design opens up possibilities of solving non-traditional tasks such as approximating lemmas or improving language modeling. However, we still aim at very good results in the traditional task of information retrieval. The conducted tests reveal exceptional performance in all the above mentioned tasks. Our stemming method is compared with three state-of-the-art statistical algorithms and one rule-based algorithm. We used corpora in the Czech, Slovak, Polish, Hungarian, Spanish and English languages. In the tests, our algorithm excels in stemming previously unseen words (the words that are not present in the training set). Moreover, it was discovered that our approach demands very little text data for training when compared with competing unsupervised algorithms.
Binary PSO with mutation operator for feature selection using decision tree applied to spam detection
2014, Knowledge-Based Systems
Citation Excerpt :
The TF–IDF (Term Frequency and Inverse Document Frequency) method extracts features by splitting each message into tokens based on spaces, tabs, and symbols [3]. A simpler model that can be used is by only considering individual keywords [4]. Other more complex models include tag-based features [5] and behavior-based features [6].
In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as α to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov–Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found α = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.
A hybrid approach for extracting informative content from web pages
2013, Information Processing and Management
Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step.
A fuzzy ranking approach for improving search results in Turkish as an agglutinative language
2012, Expert Systems with Applications
Citation Excerpt :
As an illustration, Carlberger et al. used articles taken from a Swedish newspaper as the database in their IR study (Carlberger, Dalianis, Hassel, & Knutsson, 2001). Such as Savoy, who could not find enough resources for his IR studies on Bulgarian and Hungarian, used newspapers, Sega-Standart (2002) and Magyar Hírlap (2002) (Savoy, 2007, 2008). Moreover, Can et al. selected a Turkish newspaper, Milliyet (2001–2005), for his IR study based on Turkish.
This study proposes a fuzzy ranking approach, designed for Turkish as an agglutinative language, that focuses on improving stemming techniques via using distances of characters in its search algorithm. Various studies focused on search engines are based on using stemming techniques in indexing process because of the higher percentage of relevancy that these techniques provide. However, stemming techniques may have negative effects on search results in some queries. While analyzing the search results to find the query terms those give irrelevant results and why, we observe that user’s query suffixes are crucial in search performance. Therefore, the proposed fuzzy ranking approach supports traditional stemming approaches with the use of suffixes. The search results of this approach are significantly better than stemming techniques in where stemming technique is ineffective. In terms of overall results, the fuzzy ranking approach also gives satisfactory results when compared with stemming techniques such as a Turkish stemmer (19.43% of improvement) and word truncation technique (12.61% of improvement). Moreover, it is statistically better than no stemming with 28.61% of improvement.
A framework for investigating search engines' stemming mechanisms: A case study on Bing
2022, Concurrency and Computation: Practice and Experience
Statistical stemmers: A reproducibility study
2018, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

Searching strategies for the Hungarian language

Abstract

Introduction

Section snippets

Context and related work

Test collection

IR models

Evaluation methodology

Conclusion

Acknowledgment

Information Processing and Management

Information Processing and Management

Information Processing and Management

Statistical and comparative evaluation of various indexing and search models

Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM-Transactions on Information Systems

Cross-language evaluation forum: Objective, results, achievements?

IR Journal

How effective is stemming and decompounding for German text retrieval?

IR Journal

New retrieval approaches using smart

Retrieval system evaluation

Cross-language retrieval experiments at CLEF 2002

Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation

A stop list for general text

SIGIR Forum

How effective is suffixing?

Journal of the American Society for Information Science

The TREC ad hoc experiments

Online information retrieval: concepts, principles and techniques