Thursday, July 3, 2014

A Change in Course

Background

Previously, I had the mind of “predicting” how the first syllable of a root is reduplicated. It seemed simple enough, just figure out what can be an onset or onsets (Spanish and English loans), what the vowels are (“a”, “e”, “i”, “o” and “u”, orthographically) and what can be codas (“h” and clusters excluded). But, as I automated the process of creating the possible syllables in Iloko, I was astounded by the shear number of rules each affix class would require.

Hunspell doesn’t have the mechanism to properly analyze reduplication. So, I devised a way in which it would be able to analyze reduplicated syllables: Fuse the prefix and the reduplicated syllable into one pseudo-prefix!

Example: agbasbasahe or she is reading/learning

In the example above, the first syllable (“bas”) of the root basa (“to read/learn”) is reduplicated and the verb prefix “ag-“ is attached.  Likewise, other verb prefixes follow the same pattern. (Iloko and Tagalog, for that matter, are prefix heavy.) This made it necessary to create a rule that had a form of the affix and the reduplicated target syllable that matched. Hunspell would analyze the words as “agbas-“ (prefix) and “basa” (root/stem).

PFX X 0 agbas bas ag+[RED: bas]
. . .
PFX X 0 agwuw wuw ag+[RED: wuw]

Iloko syllables can be minimally V and maximally C1C2VC, in other words, anywhere between a vowel and a closed syllable with an onset of consonants. In the history of the language, however, CVC was the maximal syllable. It was when Spanish words enter the lexicon did C1C2-type syllables became common. The C2 is commonly /l/ or /r/.

First Approach

My first though was to create an affix class of just reduplicated syllables. Any prefix could just reference this class and it would not be necessary to generate the same syllables for each and every prefix. Hunspell is capable of analyzing two prefixes or two suffixes, but not both. In addition, some affixes could only be used with a restrictive set of other affixes and circumfixes posed a problem because prefix X could only occur with suffix Y and with another set of suffixes. So, I opted to automate the process of creating rules all the “possible” syllables combined with the affix.

Writing the automation was rather straight forward: Generate rules with the prefix and V, VC, CV, CCV, CVC and CCVC syllables. After just one affix class, there were well over a thousand rules!  I scanned rule after rule and wondered if there were any roots that began with a syllable such as “yoy” or “bluw”. I haven’t performed an analysis of the structure of Iloko roots, but I doubt that any exist. The automation produced doubtful forms as a precaution, even if they eventually were not used.

Prefixes with Final Nasals

Prefixes, such as “mang-“ and “pang-“, that have final nasals, further complicating the process of predicting the form of the reduplicated syllable. Typically, the final nasal, “-ng”, of these prefixes assimilates to the point of articulation of the initial consonant of the root (if possible). Roots that start with vowels were exempt. With some roots, however, not only did the nasal assimilate but the onset would become said nasal resulting in a geminate.

pang- + dait > pan- + dait (nasal assimilation) > pannait (consonant becomes nasal).

Or, in some cases, the geminated consonant simplified.

 pang- + sao > pan- + sao > pan- +nao > panao

Assimilation affected the first syllable in non-reduplicated forms. But, when reduplicated, the reduplicated syllable was affected, while the original syllable was unaffected, e.g. “pan(n)adait” (Rubino 2000). Whether gemination or deletion of the initial consonant occurred was unpredictable, so rules for both were added, again, as a precaution since it “could” be possible and increased the number of rules.

The few issues I faced complicated the automation. In one day I spent two hours trying to debug it. Yet, it was well worth it and it took less then a minute to produce an affix file. Imagine having to write so many rules without error and by hand!

Line Count: 80,000!

As I added more affix classes, I noticed that the number of lines steadily increased. With perhaps 60 –70 of the verb affixes automated, the affix file had grown to 80,000 lines! The file itself was an astounding 7 MB with comments and 3 MB without. I compared other affix files for other languages and found that many were smaller. Hungarian, known for its agglutinative morphology, has an similar affix file with about 90,000 lines. But, with only 60 –70 of the affix classes complete, I knew that the Iloko file would reach almost 150,000 and possibly 7MB (without comments). When I tested the affix file against a few entries in the dictionary file, I noticed that there was lag. So, after more consideration, I’ve decided to take another approach.

Reasons for a Bloated Affix File

Initially my reasons for predicting the reduplicated syllable in the affix class was to reduce the number of roots that were needed and reduce the number of affix classes. Each root in the dictionary file can only take certain affixes. So, by reducing their number it was easier to assign the right affix to the proper root. But, I’ve found that the certain issues had become more and more apparent as the affix file grew.

The classes grew astoundingly in the number of rules. It became difficult to properly check them or investigate if a rule did not “behave” as expected. Reduplication was difficult to debug.

With only a few entries in the dictionary file, the analyzer took longer to parse a small set of test words. I had a nagging feeling that overall performance would be affected. If this can happen with only a handful of *.dic file entries, can you imaging with several thousand. The rules for the “doubtful” syllables might be slowing analysis down.

After comparing other affix files for other languages, 3MB did not seem acceptable.

In the end, I decided to move the reduplication into the dictionary file. The reduplicated stem reduced the complexity and number of rules in the affix file, but it shifted the “responsibility” of explicitly specifying the reduplicated stem in the dictionary thereby dividing the work.