Tuesday, June 16, 2015

How to Simulate Reduplication in Hunspell

Reduplication is a process where a part of or the whole root or stem is repeated. It is use for both inflection and derivation in Iloko.

In my approach, reduplicated whole roots or stems are listed as different entries in the *.dic file. First of all, creating a rule to predict the reduplicated form of a whole root or stem is not practical. Second, words derived from this type of reduplication can change lexical categories and meaning, i.e., derivation. For example, saka-saka (adjective) “barefoot” from saka (noun) “foot” is listed as an entry in the *.dic file and it is tagged with the affix classes that can be applied to it.

saka/xy
saka-saka/ab

Hunspell does not have a mechanism for reduplication, whole or partial. It is only capable of only dealing with what it considers prefixes and suffixes. There had to be a way to reproduce reduplication using its framework of rules! Unfortunately, it has deterred at least one person’s attempt at creating a spell-check dictionary for Tagalog as shown in the following thread: helping to implement a grammar checker...[sic].

Nevertheless, reduplication is possible if you “think outside the box”. To simulate the process, we must use the same approach used for infixation: create “compound prefixes”!

Partial word reduplication centers around the first syllable of the root (at least in Iloko and Tagalog). And, it is the syllable that we have to write the rules for. The first syllable can be any one of the following types: V, VC, CV and CVC. (V = vowel; C = consonant).

5 vowels
14 final consonants
25 initial consonants and consonant clusters
? inflectional forms (depends on affix)

All in all, if you do the math, there are 1,750 “possible” syllables, or 1,750 rules that can be created to simulate reduplication! Imagine multiplying that figure with the various inflectional verb forms and that figure can double to 3,500. And, that would be one class! It’s because of this number, I’ve devised automation to assist in creating affix classes.

Another approach is to create another entry in the dictionary with the reduplicated part of the word.

Example:
1) saka
2) saksaka
3) sasaka
4) saka-saka
5) saksaka-saka
6) sasaka-saka

Number one would be the basic form, so any affixes Number two can be used in certain verb forms or it can be used as the distributive plural. Number three can be used for certain verb forms that require only CV reduplication, for example. And, number four is a lexicalized entry that means “to go barefoot”. The other forms can then be used for forms based on “saka-saka”.

What is nice about this approach is that the affix file no longer has to “guess” reduplicated syllables or mutations that may occur because of phonological processes. The con is that the necessary stems need to be created and the appropriate affix flags have to be associated with the other forms which can tax someone who might be adding entries to the dictionary file.

Circumfixes in Hunspell

Circumfixes are a pair of affixes, usually a prefix and a suffix and in some instances in Iloko, an infix and suffix, that must co-occur. Hunspell has the capability of recognizing circumfixes. Simply assign a special flag, and when specifying rules, assign the circumfix flag to the rules of the prefix and assign both the flag of the prefix and the circumfix flag to the rules for the suffix. Simple, huh? Let’s take a look at an example: the location circumfix pag><an.

[*.aff file]
; Circumfix flag. This flag will be used to mark affix as members of a circumfix
CIRCUMFIX X
PFX P   Y    1
PFX P   0    pag/X .
SFX S   Y     1
SFX S   0     an/PX .

[*.dic file]
1
adal/S

“X” is used to designate circumfix pairs throughout the *.aff file. The prefix, pag, is assigned the flag “P” and its single rule has a “continuation” flag, the flag after the “/” following the form of the prefix. The flag of suffix of the pair is “S”. Its only rule is similarly mark, but it has the flag of the prefix. In the dictionary, only the suffix flag is used with the root adal “to learn”.

If we were to try this out on some “words”, we would see result similar to the following when using the specifications above.

pagadal [false]
adalan [false]
pagadalan [true]

Just using either the prefix or suffix alone results in invalid words. The only “correct” form is the third.

So, the above three are the main over-arching considerations that encountered while creating a Hunspell dictionary. And, as I’ve moved along and stepped back only to move forward again, I find further details that make the task challenging.

How to Simulate Infixation in Hunspell

Infixation is where the affix is “inserted” into the stem, usually it is between the first consonant (if there is one) and the vowel of the first syllable. In Iloko and other Philippine languages this type of affix that is very productive and occurs in the many paradigms of many of the lexical categories.

The only affix types that Hunspell recognizes, however, are “prefix” and “suffix”, in other words, affixing to the left and to the right of the stem. Nevertheless, rules can be written to simulate the process.

In Iloko it is rather simple: if the syllable begins with a consonant, insert the infix between it and the vowel, otherwise, treat it like a prefix.

Example
root: sarita – talk, speech
s<um>arita

root: andar – to run (of machines), to function, to operate
um- andar

The maximal syllables in Iloko is CVC, which makes infixation simple. But, with the adoption of Spanish and English loans, there are syllables that begin with two or more consonants. For example, prito ( from the past participle of Spanish freir lit “fried”) is a commonly used word in Iloko and Tagalog. Iloko’s strategy for infixation is to insert before the vowel, i.e. prinito. Initial clusters thus become another consideration.

Hunspell rules have to be written in such as way that appear as if the first consonant (if the first syllable has one) and the infix are a prefix.

PFX I Y 25
PFX I 0 um [aeiou]
PFX I b bum b
PFX I d dum d
PFX I g gum g
. . .
PFX I t tum t

The first rule is straight forward:

If the root begins with a vowel, treat the infix as a prefix and attach it to the left side of the root.

So, with uli “to ascend, go up” the result is umuli. But, the remaining rules show how to deal with roots and stems that begin with consonants.

The value in the third “column” are the characters to remove from the beginning of the root. In the first case “b” and “d”. We want to remove it because we will replace the initial letter with what is in the fourth column, “bum”, the initial consonant with the infix in place. The fifth column specifies the condition under which we want the rule to apply. As expected, it is “b”.

root: takder – to stand

1) akder (remove the ‘t’)
2) tum (assemble the pseudo-prefix)
3) tumakder (add to root’s left-word edge, the beginning)

With this strategy, rules for simulation the process of infixation can be written and accounted for. The number of rules needed, however, is determined by the number of possible onsets in Iloko: 14 single consonants and 10 clusters, so 24 distinct rules.

Thursday, July 3, 2014

A Change in Course

Background

Previously, I had the mind of “predicting” how the first syllable of a root is reduplicated. It seemed simple enough, just figure out what can be an onset or onsets (Spanish and English loans), what the vowels are (“a”, “e”, “i”, “o” and “u”, orthographically) and what can be codas (“h” and clusters excluded). But, as I automated the process of creating the possible syllables in Iloko, I was astounded by the shear number of rules each affix class would require.

Hunspell doesn’t have the mechanism to properly analyze reduplication. So, I devised a way in which it would be able to analyze reduplicated syllables: Fuse the prefix and the reduplicated syllable into one pseudo-prefix!

Example: agbasbasa “he or she is reading/learning”

In the example above, the first syllable (“bas”) of the root basa (“to read/learn”) is reduplicated and the verb prefix “ag-“ is attached. Likewise, other verb prefixes follow the same pattern. (Iloko and Tagalog, for that matter, are prefix heavy.) This made it necessary to create a rule that had a form of the affix and the reduplicated target syllable that matched. Hunspell would analyze the words as “agbas-“ (prefix) and “basa” (root/stem).

PFX X 0 agbas bas ag+[RED: bas]
. . .
PFX X 0 agwuw wuw ag+[RED: wuw]

Iloko syllables can be minimally V and maximally C₁C₂VC, in other words, anywhere between a vowel and a closed syllable with an onset of consonants. In the history of the language, however, CVC was the maximal syllable. It was when Spanish words enter the lexicon did C₁C₂-type syllables became common. The C₂ is commonly /l/ or /r/.

First Approach

My first though was to create an affix class of just reduplicated syllables. Any prefix could just reference this class and it would not be necessary to generate the same syllables for each and every prefix. Hunspell is capable of analyzing two prefixes or two suffixes, but not both. In addition, some affixes could only be used with a restrictive set of other affixes and circumfixes posed a problem because prefix X could only occur with suffix Y and with another set of suffixes. So, I opted to automate the process of creating rules all the “possible” syllables combined with the affix.

Writing the automation was rather straight forward: Generate rules with the prefix and V, VC, CV, CCV, CVC and CCVC syllables. After just one affix class, there were well over a thousand rules! I scanned rule after rule and wondered if there were any roots that began with a syllable such as “yoy” or “bluw”. I haven’t performed an analysis of the structure of Iloko roots, but I doubt that any exist. The automation produced doubtful forms as a precaution, even if they eventually were not used.

Prefixes with Final Nasals

Prefixes, such as “mang-“ and “pang-“, that have final nasals, further complicating the process of predicting the form of the reduplicated syllable. Typically, the final nasal, “-ng”, of these prefixes assimilates to the point of articulation of the initial consonant of the root (if possible). Roots that start with vowels were exempt. With some roots, however, not only did the nasal assimilate but the onset would become said nasal resulting in a geminate.

pang- + dait > pan- + dait (nasal assimilation) > pannait (consonant becomes nasal).

Or, in some cases, the geminated consonant simplified.

pang- + sao > pan- + sao > pan- +nao > panao

Assimilation affected the first syllable in non-reduplicated forms. But, when reduplicated, the reduplicated syllable was affected, while the original syllable was unaffected, e.g. “pan(n)adait” (Rubino 2000). Whether gemination or deletion of the initial consonant occurred was unpredictable, so rules for both were added, again, as a precaution since it “could” be possible and increased the number of rules.

The few issues I faced complicated the automation. In one day I spent two hours trying to debug it. Yet, it was well worth it and it took less then a minute to produce an affix file. Imagine having to write so many rules without error and by hand!

Line Count: 80,000!

As I added more affix classes, I noticed that the number of lines steadily increased. With perhaps 60 –70 of the verb affixes automated, the affix file had grown to 80,000 lines! The file itself was an astounding 7 MB with comments and 3 MB without. I compared other affix files for other languages and found that many were smaller. Hungarian, known for its agglutinative morphology, has an similar affix file with about 90,000 lines. But, with only 60 –70 of the affix classes complete, I knew that the Iloko file would reach almost 150,000 and possibly 7MB (without comments). When I tested the affix file against a few entries in the dictionary file, I noticed that there was lag. So, after more consideration, I’ve decided to take another approach.

Reasons for a Bloated Affix File

Initially my reasons for predicting the reduplicated syllable in the affix class was to reduce the number of roots that were needed and reduce the number of affix classes. Each root in the dictionary file can only take certain affixes. So, by reducing their number it was easier to assign the right affix to the proper root. But, I’ve found that the certain issues had become more and more apparent as the affix file grew.

The classes grew astoundingly in the number of rules. It became difficult to properly check them or investigate if a rule did not “behave” as expected. Reduplication was difficult to debug.

With only a few entries in the dictionary file, the analyzer took longer to parse a small set of test words. I had a nagging feeling that overall performance would be affected. If this can happen with only a handful of *.dic file entries, can you imaging with several thousand. The rules for the “doubtful” syllables might be slowing analysis down.

After comparing other affix files for other languages, 3MB did not seem acceptable.

In the end, I decided to move the reduplication into the dictionary file. The reduplicated stem reduced the complexity and number of rules in the affix file, but it shifted the “responsibility” of explicitly specifying the reduplicated stem in the dictionary thereby dividing the work.

Wednesday, October 23, 2013

Roots and Stems in Hunspell

As I wrote in another post, a Hunspell dictionary is the way to go in order to create a spell-check dictionary for Iloko. Not only is implementing one relatively “easy”, but there are a plethora of languages that have dictionaries available. Because the dictionaries are open, I can gain insight about how they created their rules and apply the same techniques to Iloko.

According to the Hunspell manual for creating affix files and dictionary files, the first line of a dictionary file is an approximate number of entries and the rest of the lines are entries of stems and/or roots followed by the flags of the affix classes that are applicable. Again, the name of each file is the code for the target language, for example, “ilo.aff” and “ilo.dic” for Iloko. More information about the affix file (*.aff) can be found in this entry. But, for now we will focus on the dictionary (*.dic) file and some of the issues and considerations in creating one.

Format

The format of the entries in the *.dic file is quite simple: entries and their attributes are written on one line.

Example:

7
adda/ABC
ama/XYZ
ina/XYZ
anak/XYZ
puso/XYZ
magna/AGH
saan/KLM

In the example above, there are seven entries and each entry rests on a single line. The valid affix classes that apply to the entry follow the form after a slash (“/”). By default each affix class flag is only one character long as in the example, e.g. “A”, “B” and “C” for the first entry. Two-character or numbers can be used as affix class flags. Their use is signaled in the *.aff file.

Although I have not reached the maximum number of single-character flags, I can imagine that the number of Iloko affixes, especially prefixes and reduplication (I will explain later), can reach that maximum. As a result in embarking on creating a spelling dictionary, I enabled two-character flags.

Issues to Consider

As entries are added to the *.dic file, there are a few issues to consider while they are added. The first two have parallels in English, but the last one is specific to Iloko and possibly to other Philippine-type language.

Categorization

In English each word belongs to one, two or more parts of speech or lexical categories, such as “nouns”, “verbs” and “prepositions”. Iloko roots and stem can be categorized in the same way, but they are more fluid. The root tennis is a noun or can be used as an adjective, e.g. tennis shoes. However, unlike most nouns it cannot take the “-(e)s” plural *tennises, because in practice there is only one sport called “tennis”. Iloko has borrowed the word as tenis, and it can be classed as a noun as well, but it can be verbalized with the prefix ag-, e.g. agtenis “to play tennis”, an insight into the versatility of Iloko.

There are many such roots in Iloko which are nouns that can become verbs through affixation. Even words that would normally not have an “affix” may take them, e.g. wen “yes” versus wensa “maybe yes”. Technically “-sa” is an enclitic. Hunspell treats anything written after the stem or root as a suffix.

Great care must be taken to assign the right affix class to a root because its semantics must be considered. Just because one verbal affix can be used with one nominal root, does not mean that it will work with all.

differing stems

In English, the past tense for sing changes to sang in the past because of ancient Germanic ablaut or apophony. Ablaut can also be used to change the lexical category of the word, such as abode (noun) and abide (verb). Luckily for English speakers, sang does not have any further terminations, and the pair abode/abide are members of differing categories.

Iloko has roots that change as well. In Iloko, however, syncope, loss of a unit of sound, is the most common because of shifts in stress after affixation. In a few cases, there are two root forms within the same paradigm! For example, in pa-dakkel “to make big, enlarge” the root, dakkél, changes to dakl- because suffixation.

pa-dakl-en (neutral)
p<in>a-dakkel (perfective)

The reverse of syncope, epenthesis, or the addition of vowels or consonants for euphony, is very rare. When the prefix, ipa-, is affixed to serrek “entrance, work”, the result is ipastrek “to have, allow to enter” where a “t” is inserted; ipaserrek, however, is a permissible form.

Different stems pose a problem. They can be listed along with the original root in the *.dic file, but each must be assigned the correct affixes. And, because sometimes the differing stems can occur in the same paradigm, the affix class must be split into sub-classes. Another approach is to “predict” the syllable that will be deleted using the rules in the *.aff file.

In my approach, I opted to list the reduced stem instead of over-burdening the *.aff file. But, as a compromise affix classes that use both the full root and the reduced (or augmented, in the case of stem with an epenthetic phoneme) be added as a separate entry with the appropriate affix classes that apply.

Reduplication

Reduplication is an integral process in Iloko and the morphologies of many of the Philippine languages. Partial reduplication, usually the first syllable, is the most common means to produce a different form. In English, this only occurs in very few words and is not part very productive process, e.g. “zig-zag”, “ping-pong”, etc.

In the approach that I’ve chosen, roots that are entirely reduplicated (e.g. sakasaka “barefoot” < saka “foot”) are listed in the *.dic file mainly because they have a differing meaning than the original root and that they do not follow any predictable patterns. Partially reduplicated stems, on the other hand, are not listed in the *.dic file. As we’ll see in another post, I will explain why I’ve not treated partially reduplicated stems as “stems”. Instead, they are treated as part of the affix – they are predominantly prefixes.

Affixes in Hunspell

The affixes or the bound morphemes of the target language reside in the Hunspell *.aff file in collections of rules called classes. In addition, the *.aff file contains language-specific settings and how affixes and stems are combined. But, the main focus here will be the classes.

Affix Classes

Affix classes are a collection of related forms organized into rules which specify how they can combine with entries (roots or stems) found in the dictionary file (*.dic).

The first line of the class, specifies its type, its flag or unique identifier, whether it can combine with other classes and the total number of rules in the class. The next line is the first rule of the class. Each following line thereafter is a different rule and a different form of the affix and how it is applied. A good example of this is the regular English plural, “-s”. Depending on the end of the word, the form is “-es” if the word ends in “s”, “x”, or “z”. Or, the form would be “-ies” if the word ends in “-y”. Otherwise, it’s just “-s”. Each of these would have its own rule.

Rules are formatted into columns separated by white space. The first column specifies one of the two types of affixes that Hunspell recognizes, “SFX” for “suffix” and “PFX” for “prefix”.

Column two of the rule is the flag or unique identifier. This is the what is used with entries in the dictionary file. The flag can be any arbitrary upper- or lower-case letter (Hunspell distinguishes case), punctuation mark or symbol. A setting can enable two-letter flags or numbers.

Column three contains the letters to remove before the form is added. If nothing needs to be removed, it is just “0” (zero character). If the affix is a prefix, letters at the begging of the stem or root are removed before adding the form; if it is a suffix, then letters are removed from the end.

Column four contains the form or allomorph of the affix as applied in the condition in column five. As in the English plural example this would be “-es”, “-ies” or “-s”.

Column five specifies the condition using a regular expression where the form can be applied. Again, using the aforementioned example, “s”, “x” or “z” would be the conditions where it should not apply can also be supplied here.

Additional information can follow the fifth column in an optional sixth column, but it is only useful for debugging or adding comments. (Not shown in the example above)

In the example rule, the class (“S”) is specified by the first line. It is a very simplified rule that applies the regular plural in English. The first rule deals with roots that end in “-y”; it excludes “vowel-y” combinations. The “-y” in the third column, specifies that “y” will be removed form the end of the stem before the form is added. The second rule says, “if the end of the stem ends in a vowel and “-y”, just add “-s”. Notice that no letters are removed (the “0”) as in the rest of the rules unlike the first rule. The third rule shows how to deal with roots or stems that end in “-s”, “-x” or “-z”. Add “-es”. And, the fourth shows the default case. In others words, if the stem does not end in the groups of letters in the square brackets. The caret (“^”) in cases where there are a list of letters means “NOT”.

[*.dic file]
1
pot/S
ax/S
party/S

The *.dic file stub above just shows how the affix is applied to a root or stem. According to the affix class, pots is valid, but *potes or *poties are invalid; axes is valid, but not *axs nor *axies; parties is valid, but not *partys nor *partyes.

In the Hunspell dictionaries for English -- there were several for different areas and fields -- there were under 20 affix classes. For languages, such as Spanish, there are a few more classes and a long list of rules. As for Hungarian, there were far more classes and hundreds of rules for some of the classes… Yes, hundreds! Hungarian, by the way, is the reason why this engine is called “Hunspell” because it was a fork for a previous engine to better handle Hungarian which is known for its extensive agglutinating morphology.

As for my experimentations for Iloko, however, the number of affix classes has numbered into the hundreds depending on the approach taken. In one iteration, the *.aff file reached 20+ MB. I have been experimenting and have drastically reduced the file size to under four MB with the latest iteration.

[Update: I have since changed my initial approach to keep the complexity of the affix classes and the rules down. You can read more here.]

[Update: I have further changed my approach and have reduced the “bloat”.]

Spell Checker for Iloko

Iloko has made its presence on the Internet in the form of blogs (mannurat.com), web sites (iluko.com), online dictionaries and Facebook (Ilocano.org). But, the one space in this digital world that I have seen very little of it is in software.

I’ve posted about Ultradefrag in the past and it is the only application that has a localized UI (User Interface) available in Iloko. So, there is some inroads and an example for others to follow. In addition to Iloko, the application also has a Waray-Waray option. I’ve written about helping in localizing Mozilla Firefox into Iloko, but for the time being that is on hold.

Recently, I was using Facebook in Google Chrome typing my comment in Iloko. If you take a look at the screen capture, every word in my comment is underlined in red! Why? Chrome does not recognize the language and in the Settings Iloko cannot be selected. The only word recognized is “Iloko” which I added to the custom internal dictionary. Lo and behold! Filipino (A.K.A. “Tagalog”) is available, but not for spell checking. I checked to see if there was an Iloko extension, but none is available. I investigated further. I found out that the spell check used in Chrome is Hunspell. Sadly, a dictionary for Iloko is not available. So, the “red squigglies” will continue under words as I type. This irritation (and I imagine that it irks other Iloko speakers) prompted to me to investigate how to create a spelling dictionary for Iloko and Hunspell.

Why Hunspell?

Hunspell is free and open-source software. It is not a stand-alone program but a set of libraries that can be incorporated into applications. Among them are:

OpenOffice.org – Free and open-source “office” suite. Similar to Microsoft Office. A Tagalog spell-checker extension can be found on their site.
LibreOffice – An offshoot of OpenOffice.org, but more “cutting edge” so they are able to share dictionaries.
Google Chrome – A popular web browser.
Mozilla Firefox – Another popular web browser.
Mac OS X – The Mac Operating System.
SDL Trados – Software to help localize and translated software.

Creating a dictionary is relative “easy”. The “dictionary” actually is composed of two files, an affix file (*.aff) that contains the affixes, and a dictionary file (*.dic) that contains a list of roots and stems. Each entry in the dictionary file references affix classes or “rules” in the *.aff file. In other words, Hunspell will “figure out” all the possible “words”. Words that it cannot determine are “incorrect”, so it is crucial to get the rules right. Each file is named according to the target language code, e.g. “ilo.aff” and “ilo.dic” for Iloko.

With the ease of creating the necessary files and the the wide-spread use of Hunspell, it’s quite possible to say that after creating an Iloko dictionary, the dictionary can be used with many popular applications.

One thing to note is that Hunspell just checks spelling. It does not check syntax. So, I can type a string of words that have no association with one another and only their spellings will be checked.

Next, I’ll talk about each of the files, the *.aff file and the *.dic file, and some of the issues I’ve encountered while attempting to create a spelling dictionary.