2.1 German Stemmers
In this section we provide an overview of the German stemmers that we studied, briefly outlining their availability and the algorithms used. We show the differences between them with the example shown in Fig.
1, where we stemmed the word “Adler” (eagle). We show the stem produced and the other words reduced to the same stem for each stemmer. All stemmers except Text::German have the same preprocessings steps which are lowercasing the word and replacing umlauts with their normalized vowel versions (e.g., ü is replaced with ue). These steps will therefore not be mentioned below.
Snowball. In 1996, Martin Porter developed the Snowball stemmer for English (Porter
1980). It became by far the most widely used stemmer for English. The Snowball team has developed stemmers for many European languages, which are included as a set in important natural language processing toolkits such as NLTK (Bird et al.
2009) for Python or Lingua::Stem for Perl.
The Snowball German stemmer is an adaptation of the original English version and thus restrains itself to suffix-stripping. It defines two regions R1 and R2, where R1 “is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel” and R2 is defined in the same way, with the difference that the definition is applied inside of R1. After defining R1 and R2 Snowball deletes a number of suffixes if they appear in R1 or R2. It does not do this recursively but instead in three steps, in each of which at most one suffix can be stripped. The first two steps strip fairly common suffixes like “ern” or “est”, while the third step strips derivational suffixes, e.g., “isch” or “keit”, which are fairly uncommon.
In our example, the Snowball stemmer correctly places “Adlers” (eagle, genitive case), “Adlern” (eagles, dative case) and “Adler” (eagle) together in the stem “adl”. However, it also incorrectly stems “adle”, which is the first person singular of “adeln” (to ennoble) to “adl”. This is because the length restriction on how short stems can become is defined in terms of R1 and R2, as explained above, and in this example, R1 for all four words is the part after “adl”.
Text::German. The stemmer in the Perl CPAN Module Text::German was, as far as we could find out, developed in 1996 at the Technical University of Darmstadt by Ulrich Pfeifer, following work by Gudrun Putze-Meier for which no reference is available. It is not currently actively supported. We made a number of efforts to contact both scientists but were unsuccessful.
What sets Text::German apart from the other stemmers examined here is the fact that it strips prefixes, and that it uses small lists of prefixes, suffixes and roots to identify the different parts of a word. Although the implementation in CPAN has significant flaws, the idea is novel and produced good results, as can be seen in Sect.
3.3.
While the behaviour of Text::German is at times difficult to understand due to its binary-encoded rules, we think that its performance on our example is primarily due to two factors. One is that “ers” is not in its list of suffixes, which is why “Adlers” is stemmed to itself. The other is that it does not lowercase stems, which results in “adle” (correctly) being stemmed seperately.
Caumanns. The stemmer proposed by Caumanns
(
1999) is unique in two ways. One is that it uses recursive suffix stripping of the character sequences “e”, “s”, “n”, “t”, “em”, “er” and “nd”, which are the letters out of which every declensional suffix for German is built. The other is that it strips “ge” before and after the word, which makes it one of the two stemmers that stem prefixes. It also substitutes “sch”, “ch”, “ei” and “ie” with special characters so they are not separated and replaces them back at the end of the stemming process.
In our example, the Caumanns stemmer conflates all four words to the same stem “adl”. This is because of the recursive suffix stripping and because its length constraint is not producing words shorter than three characters, which is why “adle” was stemmed to “adl” which is exactly three characters long.
UniNe. The UniNE stemmer, developed by Savoy
(
2006) from the University of Neuchatel in 2006, has an aggressive and a light stemming option.
Light Option. The light option merely attempts to strip plural morphemes. After the standard Umlaut substitutions, it strips one of “nen”, “se”, “e” before one of “n”, “r” and “s” or one of “n”, “r” and “s” at the end of the word. As only one of these options can take effect, it is a very conservative stemmer.
In the “Adler” example, the stemmer stems “Adlers” and “Adlern” to “adler” and “Adler” and “adle” to “adle”. It does not go further because it removes at most two letters and doesn’t strip suffixes recursively.
Aggressive Option. The aggressive option goes through a number of suffix stripping steps, which always depend on the length of the word. The difference with the other stemmers is that UniNE has two groups of stripping operations and at most one out of each group is executed. Also, its conditions for stripping “s” and “st” are very similar to those of the Snowball stemmer, which defines a list of consonants that are valid s- and st-endings respectively and have to occur before the “s” or “st” so that the consonant in question is stripped.
This stemmer’s main problem in our example is that it stems “Adlers” to itself because “r” is not included in its list of valid s-endings which have to occur before “s” for it to be stripped.