Wednesday, October 23, 2013

Roots and Stems in Hunspell

As I wrote in another post, a Hunspell dictionary is the way to go in order to create a spell-check dictionary for Iloko. Not only is implementing one relatively “easy”, but there are a plethora of languages that have dictionaries available. Because the dictionaries are open, I can gain insight about how they created their rules and apply the same techniques to Iloko.

According to the Hunspell manual for creating affix files and dictionary files, the first line of a dictionary file is an approximate number of entries and the rest of the lines are entries of stems and/or roots followed by the flags of the affix classes that are applicable. Again, the name of each file is the code for the target language, for example, “ilo.aff” and “ilo.dic” for Iloko. More information about the affix file (*.aff) can be found in this entry. But, for now we will focus on the dictionary (*.dic) file and some of the issues and considerations in creating one.

Format

The format of the entries in the *.dic file is quite simple: entries and their attributes are written on one line.

Example:

7
adda/ABC
ama/XYZ
ina/XYZ
anak/XYZ
puso/XYZ
magna/AGH
saan/KLM

In the example above, there are seven entries and each entry rests on a single line. The valid affix classes that apply to the entry follow the form after a slash (“/”). By default each affix class flag is only one character long as in the example, e.g. “A”, “B” and “C” for the first entry. Two-character or numbers can be used as affix class flags. Their use is signaled in the *.aff file.

Although I have not reached the maximum number of single-character flags, I can imagine that the number of Iloko affixes, especially prefixes and reduplication (I will explain later), can reach that maximum. As a result in embarking on creating a spelling dictionary, I enabled two-character flags.

Issues to Consider

As entries are added to the *.dic file, there are a few issues to consider while they are added. The first two have parallels in English, but the last one is specific to Iloko and possibly to other Philippine-type language.

Categorization

In English each word belongs to one, two or more parts of speech or lexical categories, such as “nouns”, “verbs” and “prepositions”. Iloko roots and stem can be categorized in the same way, but they are more fluid. The root tennis is a noun or can be used as an adjective, e.g. tennis shoes. However, unlike most nouns it cannot take the “-(e)s” plural *tennises, because in practice there is only one sport called “tennis”. Iloko has borrowed the word as tenis, and it can be classed as a noun as well, but it can be verbalized with the prefix ag-, e.g. agtenis “to play tennis”, an insight into the versatility of Iloko.

There are many such roots in Iloko which are nouns that can become verbs through affixation. Even words that would normally not have an “affix” may take them, e.g. wen “yes” versus wensa “maybe yes”. Technically “-sa” is an enclitic. Hunspell treats anything written after the stem or root as a suffix.

Great care must be taken to assign the right affix class to a root because its semantics must be considered. Just because one verbal affix can be used with one nominal root, does not mean that it will work with all.

differing stems

In English, the past tense for sing changes to sang in the past because of ancient Germanic ablaut or apophony. Ablaut can also be used to change the lexical category of the word, such as abode (noun) and abide (verb). Luckily for English speakers, sang does not have any further terminations, and the pair abode/abide are members of differing categories.

Iloko has roots that change as well. In Iloko, however, syncope, loss of a unit of sound, is the most common because of shifts in stress after affixation. In a few cases, there are two root forms within the same paradigm! For example, in pa-dakkel “to make big, enlarge” the root, dakkél, changes to dakl- because suffixation.

pa-dakl-en (neutral)
p<in>a-dakkel (perfective)

The reverse of syncope, epenthesis, or the addition of vowels or consonants for euphony, is very rare. When the prefix, ipa-, is affixed to serrek “entrance, work”, the result is ipastrek “to have, allow to enter” where a “t” is inserted; ipaserrek, however, is a permissible form.

Different stems pose a problem. They can be listed along with the original root in the *.dic file, but each must be assigned the correct affixes. And, because sometimes the differing stems can occur in the same paradigm, the affix class must be split into sub-classes. Another approach is to “predict” the syllable that will be deleted using the rules in the *.aff file.

In my approach, I opted to list the reduced stem instead of over-burdening the *.aff file. But, as a compromise affix classes that use both the full root and the reduced (or augmented, in the case of stem with an epenthetic phoneme) be added as a separate entry with the appropriate affix classes that apply.

Reduplication

Reduplication is an integral process in Iloko and the morphologies of many of the Philippine languages. Partial reduplication, usually the first syllable, is the most common means to produce a different form. In English, this only occurs in very few words and is not part very productive process, e.g. “zig-zag”, “ping-pong”, etc.

In the approach that I’ve chosen, roots that are entirely reduplicated (e.g. sakasaka “barefoot” < saka “foot”) are listed in the *.dic file mainly because they have a differing meaning than the original root and that they do not follow any predictable patterns. Partially reduplicated stems, on the other hand, are not listed in the *.dic file. As we’ll see in another post, I will explain why I’ve not treated partially reduplicated stems as “stems”. Instead, they are treated as part of the affix – they are predominantly prefixes.

Affixes in Hunspell

The affixes or the bound morphemes of the target language reside in the Hunspell *.aff file in collections of rules called classes. In addition, the *.aff file contains language-specific settings and how affixes and stems are combined. But, the main focus here will be the classes.

Affix Classes

Affix classes are a collection of related forms organized into rules which specify how they can combine with entries (roots or stems) found in the dictionary file (*.dic).

The first line of the class, specifies its type, its flag or unique identifier, whether it can combine with other classes and the total number of rules in the class. The next line is the first rule of the class. Each following line thereafter is a different rule and a different form of the affix and how it is applied. A good example of this is the regular English plural, “-s”. Depending on the end of the word, the form is “-es” if the word ends in “s”, “x”, or “z”. Or, the form would be “-ies” if the word ends in “-y”. Otherwise, it’s just “-s”. Each of these would have its own rule.

Rules are formatted into columns separated by white space. The first column specifies one of the two types of affixes that Hunspell recognizes, “SFX” for “suffix” and “PFX” for “prefix”.

Column two of the rule is the flag or unique identifier. This is the what is used with entries in the dictionary file. The flag can be any arbitrary upper- or lower-case letter (Hunspell distinguishes case), punctuation mark or symbol. A setting can enable two-letter flags or numbers.

Column three contains the letters to remove before the form is added. If nothing needs to be removed, it is just “0” (zero character). If the affix is a prefix, letters at the begging of the stem or root are removed before adding the form; if it is a suffix, then letters are removed from the end.

Column four contains the form or allomorph of the affix as applied in the condition in column five. As in the English plural example this would be “-es”, “-ies” or “-s”.

Column five specifies the condition using a regular expression where the form can be applied. Again, using the aforementioned example, “s”, “x” or “z” would be the conditions where it should not apply can also be supplied here.

Additional information can follow the fifth column in an optional sixth column, but it is only useful for debugging or adding comments. (Not shown in the example above)

In the example rule, the class (“S”) is specified by the first line. It is a very simplified rule that applies the regular plural in English. The first rule deals with roots that end in “-y”; it excludes “vowel-y” combinations. The “-y” in the third column, specifies that “y” will be removed form the end of the stem before the form is added. The second rule says, “if the end of the stem ends in a vowel and “-y”, just add “-s”. Notice that no letters are removed (the “0”) as in the rest of the rules unlike the first rule. The third rule shows how to deal with roots or stems that end in “-s”, “-x” or “-z”. Add “-es”. And, the fourth shows the default case. In others words, if the stem does not end in the groups of letters in the square brackets. The caret (“^”) in cases where there are a list of letters means “NOT”.

[*.dic file]
1
pot/S
ax/S
party/S

The *.dic file stub above just shows how the affix is applied to a root or stem. According to the affix class, pots is valid, but *potes or *poties are invalid; axes is valid, but not *axs nor *axies; parties is valid, but not *partys nor *partyes.

In the Hunspell dictionaries for English -- there were several for different areas and fields -- there were under 20 affix classes. For languages, such as Spanish, there are a few more classes and a long list of rules. As for Hungarian, there were far more classes and hundreds of rules for some of the classes… Yes, hundreds! Hungarian, by the way, is the reason why this engine is called “Hunspell” because it was a fork for a previous engine to better handle Hungarian which is known for its extensive agglutinating morphology.

As for my experimentations for Iloko, however, the number of affix classes has numbered into the hundreds depending on the approach taken. In one iteration, the *.aff file reached 20+ MB. I have been experimenting and have drastically reduced the file size to under four MB with the latest iteration.

[Update: I have since changed my initial approach to keep the complexity of the affix classes and the rules down. You can read more here.]

[Update: I have further changed my approach and have reduced the “bloat”.]

Spell Checker for Iloko

Iloko has made its presence on the Internet in the form of blogs (mannurat.com), web sites (iluko.com), online dictionaries and Facebook (Ilocano.org). But, the one space in this digital world that I have seen very little of it is in software.

I’ve posted about Ultradefrag in the past and it is the only application that has a localized UI (User Interface) available in Iloko. So, there is some inroads and an example for others to follow. In addition to Iloko, the application also has a Waray-Waray option. I’ve written about helping in localizing Mozilla Firefox into Iloko, but for the time being that is on hold.

Recently, I was using Facebook in Google Chrome typing my comment in Iloko. If you take a look at the screen capture, every word in my comment is underlined in red! Why? Chrome does not recognize the language and in the Settings Iloko cannot be selected. The only word recognized is “Iloko” which I added to the custom internal dictionary. Lo and behold! Filipino (A.K.A. “Tagalog”) is available, but not for spell checking. I checked to see if there was an Iloko extension, but none is available. I investigated further. I found out that the spell check used in Chrome is Hunspell. Sadly, a dictionary for Iloko is not available. So, the “red squigglies” will continue under words as I type. This irritation (and I imagine that it irks other Iloko speakers) prompted to me to investigate how to create a spelling dictionary for Iloko and Hunspell.

Why Hunspell?

Hunspell is free and open-source software. It is not a stand-alone program but a set of libraries that can be incorporated into applications. Among them are:

OpenOffice.org – Free and open-source “office” suite. Similar to Microsoft Office. A Tagalog spell-checker extension can be found on their site.
LibreOffice – An offshoot of OpenOffice.org, but more “cutting edge” so they are able to share dictionaries.
Google Chrome – A popular web browser.
Mozilla Firefox – Another popular web browser.
Mac OS X – The Mac Operating System.
SDL Trados – Software to help localize and translated software.

Creating a dictionary is relative “easy”. The “dictionary” actually is composed of two files, an affix file (*.aff) that contains the affixes, and a dictionary file (*.dic) that contains a list of roots and stems. Each entry in the dictionary file references affix classes or “rules” in the *.aff file. In other words, Hunspell will “figure out” all the possible “words”. Words that it cannot determine are “incorrect”, so it is crucial to get the rules right. Each file is named according to the target language code, e.g. “ilo.aff” and “ilo.dic” for Iloko.

With the ease of creating the necessary files and the the wide-spread use of Hunspell, it’s quite possible to say that after creating an Iloko dictionary, the dictionary can be used with many popular applications.

One thing to note is that Hunspell just checks spelling. It does not check syntax. So, I can type a string of words that have no association with one another and only their spellings will be checked.

Next, I’ll talk about each of the files, the *.aff file and the *.dic file, and some of the issues I’ve encountered while attempting to create a spelling dictionary.