Wednesday, October 23, 2013

Affixes in Hunspell

The affixes or the bound morphemes of the target language reside in the Hunspell *.aff file in collections of rules called classes. In addition, the *.aff file contains language-specific settings and how affixes and stems are combined. But, the main focus here will be the classes.

Affix Classes

Affix classes are a collection of related forms organized into rules which specify how they can combine with entries (roots or stems) found in the dictionary file (*.dic).

The first line of the class, specifies its type, its flag or unique identifier, whether it can combine with other classes and the total number of rules in the class. The next line is the first rule of the class. Each following line thereafter is a different rule and a different form of the affix and how it is applied. A good example of this is the regular English plural, “-s”. Depending on the end of the word, the form is “-es” if the word ends in “s”, “x”, or “z”. Or, the form would be “-ies” if the word ends in “-y”. Otherwise, it’s just “-s”. Each of these would have its own rule.

Rules are formatted into columns separated by white space. The first column specifies one of the two types of affixes that Hunspell recognizes, “SFX”  for “suffix” and “PFX” for “prefix”.

AffixRule

Column two of the rule is the flag or unique identifier. This is the what is used with entries in the dictionary file. The flag can be any arbitrary upper- or lower-case letter (Hunspell distinguishes case), punctuation mark or symbol. A setting can enable two-letter flags or numbers.

Column three contains the letters to remove before the form is added. If nothing needs to be removed, it is just “0” (zero character). If the affix is a prefix, letters at the begging of the stem or root are removed before adding the form; if it is a suffix, then letters are removed from the end.

Column four contains the form or allomorph of the affix as applied in the condition in column five. As in the English plural example this would be “-es”, “-ies” or “-s”.

Column five specifies the condition using a regular expression where the form can be applied. Again, using the aforementioned example, “s”, “x” or “z” would be the conditions where it should not apply can also be supplied here.

Additional information can follow the fifth column in an optional sixth column, but it is only useful for debugging or adding comments. (Not shown in the example above)

In the example rule, the class (“S”) is specified by the first line. It is a very simplified rule that applies the regular plural in English. The first rule deals with roots that end in “-y”; it excludes “vowel-y” combinations. The “-y” in the third column, specifies that “y” will be removed form the end of the stem before the form is added. The second rule says, “if the end of the stem ends in a vowel and “-y”, just add “-s”. Notice that no letters are removed (the “0”) as in the rest of the rules unlike the first rule. The third rule shows how to deal with roots or stems that end in “-s”, “-x” or “-z”. Add “-es”. And, the fourth shows the default case. In others words, if the stem does not end in the groups of letters in the square brackets. The caret (“^”) in cases where there are a list of letters means “NOT”.

[*.dic file]
1
pot/S
ax/S
party/S
 

The *.dic file stub above just shows how the affix is applied to a root or stem. According to the affix class, pots is valid, but *potes or *poties are invalid; axes is valid, but not *axs nor *axies; parties is valid, but not *partys nor *partyes.

In the Hunspell dictionaries for English -- there were several for different areas and fields -- there were under 20 affix classes. For languages, such as Spanish, there are a few more classes and a long list of rules. As for Hungarian, there were far more classes and hundreds of rules for some of the classes… Yes, hundreds! Hungarian, by the way, is the reason why this engine is called “Hunspell” because it was a fork for a previous engine to better handle Hungarian which is known for its extensive agglutinating morphology.

As for my experimentations for Iloko, however, the number of affix classes has numbered into the hundreds depending on the approach taken. In one iteration, the *.aff file reached 20+ MB. I have been experimenting and have drastically reduced the file size to under four MB with the latest iteration.

[Update: I have since changed my initial approach to keep the complexity of the affix classes and the rules down. You can read more here.]

[Update: I have further changed my approach and have reduced the “bloat”.]

No comments:

Post a Comment