Wednesday, December 13, 2017

Spell Checker - On Reduplication

Introduction

Reduplication is a process of repeating all or part of a root or a stem and signals some sort of grammatical function.

Reduplication in Iloko is very productive and can be applied to various word categories. It marks the following:

  • Plurality in nouns and in verbs (plural arguments in focus)
  • Limitation – “only one”
  • Progressive action in verbs
  • Intensification in adjectives and adverbs
  • Repetition in verbs

Doing the Math

Much of the reduplication in Iloko affects the beginning of the root or stem, involving the first consonant-vowel or consonant-vowel-consonant.

Hunspell rules to need to be written to match these combinations. As a result, rules tend toward the several hundred. (No kidding!) More often than not reduplication also combines with an affix. This base number is then multiplied by the different forms of the affix to show further inflection and further increases the number into the few more thousands. Then, there are alternate forms because of phonological processes (e.g., progressive assimilation) or alternate forms.

The affix file now becomes a great pain to manage because it takes too long to load. Even a simple text editor is not programmed to handle files of such a size.

There are a few hundred affixes in Iloko and many require reduplication. Soon the size of the Iloko affix file is ballooning to a whopping 30 MB. Compared to the Hungarian affix file (7 MB), a language known for its agglutinative morphology, this seems unreasonable and has a direct impact on the load time, something that end user will notice (and complain about).

Reducing the Size

If there were a problem with just one rule, it was difficult to root out the cause. Text editors would hang after loading or saving the file.

The number one goal becomes reducing size and manageability.

I had to rethink the entire approach and experiment.

Redundancy

The first problem with my initial approaches was redundancy. The same set of reduplication rules were repeated. It wasn’t necessary to create rules for each combination of an prefix form and reduplicated syllable.

Why can’t one affix class be responsible for just reduplicated syllables? Why can’t another be responsible for inflectional morphology?

To solve this, I devised a divide-and-conquer approach.

I created an affix class for the reduplicated syllables and had them reference the associated class that contained the prefix forms, which in turn, referenced the proper set of suffixes.

Granted, there are affix classes for reduplication and they share similar rules, if not exactly the same ones in the new affix file. But, the prefixes or suffixes they refer to differ. I tried to stamp out redundancy all together. But, I’m satisfied with this compromise. The reduplication classes have just a couple thousand rules which is easier to manage. If there is a problem, it’s usually the prefix that they refer to.

The next issue to address dealt with redundancy with circumfixes.

Circumfixes Create Bloat

As stated many times, Hunspell only recognizes prefixes and suffixes. That said, it does have a mechanism to recognize a pair of prefixes and suffixes together as one unit to create a circumfix by use of an operational flag. Iloko has a number of circumfixes, so this is a welcome feature.

Regardless, the feature would allow one prefix and one suffix pairing. So, dividing the responsibilities of inflected prefixes rules and reduplication rule between two set was not possible. Again, what was needed was a set of rules numbering in the thousands for each probable permutation.

I didn’t gain much space or peace of mind with circumfixes.

It’s a Prefix… Sort of

In the end, I decided to forego the circumfix feature and treat circumfixes as if they were prefixes.

First, I implemented the divide and conquer approach: One set of rule or an affix class was responsible for reduplication and other classes for the inflected prefix forms – From several thousand, to just under two. This dealt with the left side of the root.

Second, separate entries for –EN or –AN or both were created in the dictionary file. Special suffix rules handled the various pronoun and enclitic forms. As a result, what was repeated needlessly, was reduced to a few affixes classes that could be shared among the prefixes. The only caveat is that the prefixes MUST be used with the right dictionary entry; they already reference the right suffix classes.

Slimmed and Trimmed

I must say that I’m happy with this approach.

What had been a text file, A TEXT FILE, of several megabytes (20+) has been reduced to one of just under four at 3.6MB. It’s also more manageable to edit and to suss problems out. In addition, dictionary load times are shorter, too, something that end users should not notice which is a good thing.