Thursday, May 17, 2018

Iloko Spell Check: Status

It has been a very long time since I posted about the Iloko spell check.

Hunspell

As a refresher, the spell check is composed of rules to use with Hunspell spell-checker and morphological analyzer library (Wikipedia, Github) to verify the spelling of words for a specific language or language variant. There are many languages that already have spell-check dictionaries available. Hunspell is integrated with many software titles, some of which you may be using already. One that comes to mind is Mozilla Firefox. (I already have an Add-On to use with Firefox in the works. But, that is for another post.) Another two that also come to mind are LibreOffice and OpenOffice.org. Both are productivity suites for word processing, data base management, slideshow, etc.

Files

In order for Hunspell to verify words, two files are needed. These are plain-text files and can be read/written with any text editor. Once you understand how the file is formatted, it is easy to understand how everything relates.

The first of the files is the “Dictionary” file (ilo.dic). It is the simplest of the files and contains a list of “entries”, roots or stems, that the spell-checker can recognize to create correctly spelt words. Each is associated with “flags” or tags that specify attributes or operations that can be performed on the entry. For example, a flag can specify that the entry is only valid if capitalized. So, “Philippines” is valid, but “philippines” is not. (Even in my application, the former is underlined with the ubiquitous “red, squiggly line” indicating that it is incorrect.) Or, the entry can be flagged as one that the spell-checker should recognize but it should not suggest. These are usually “non-standard” words or offensive words. But, the majority of the flags refer to “affixes” that can be applied to create valid words. These affixes are specified in the other file.

The other file is the “Affix” file (ilo.aff). As the name suggest, it contains information about the affixes used in the language. It also includes other basic information about the target language, information such as the language (or variant) name and which script the language is written in. But, the bulk is “affix classes” or groups of rules that specify how they can combine with entries in the dictionary file. Examples of affixes in English are the ‹-ed› for the past tense in English or the ‹re-› prefix that shows repetition. The affixes can be as simple as the two just mentioned, but something like the “-s” plural require a few rules to specify how it is applied. So, there are rules to turn ‹y› to ‹ie› or add ‹e› after a word that ends in ‹ch›, ‹s›, ‹x› or ‹z› before adding ‹s›.

Iloko

After a few attempts, I believe that I’ve struck upon a system to organize the affixes in the affix file for Iloko. Simply put, it uses a divide-and-conquer methodology.

Reduplication, a process where part or all of a root or stem is repeated, is a major consideration and contributed to the size of the file. Since Hunspell does not have a built-in mechanism, rules targeting specific syllables are needed. In my latest scheme, sets of reduplicate syllables are considered as just affix classes or “duplifixes”, repeate parts of a stem that carry meaning. And, the affixes “proper” constitute as their own affix classes. (More about this in another post.)

Currently, the affix file is rather extensive and still a bit larger than even Hungarian’s (77KB), a language notorious for its agglutinative morphology. Iloko’s stands at five MB and takes time to load. The culprit  is reduplication as mentioned previously. It is very productive in Iloko and signals various grammatical categories (e.g., pluralization, aspect, etc.), and as such, separate rules are needed to match reduplicated syllables. Many of the affix classes have as many as 1,200 rules. Up for consideration is that some of the affixes could be eliminated and the words they create (or predict) be separate entries in the dictionary file. Nevertheless, the current schemes appears to be working after some testing. And, as I continue testing anything that is awry is diagnosed and fixed. But, so far there haven’t been any further issues that I’ve found with the affixes nor how their rules are specified.

The dictionary file is slowly growing as I add entries and associate the applicable affixes. But, this is a meticulous task because not all roots or stems can take all affix classes. I am using Carl Rubino’s Ilokano Dictionary since it is root-based and takes away any guessing. At the same time, I have to rely on my own recollection (good or bad) to assign affix classes. Currently, there are about 3,000 entries. They include the following closed classes:

  • Functional words (conjunctions, adpositions, the ligature)
  • Determinatives (e.g., “this”, “those”, “the”)
  • Pronouns (e.g., “I”, “them”)
  • Interjections (e.g., “Susmaryosep!”, “Ukinnamon!”)
  • Numbers (including Spanish numbers) and quantifiers (e.g., “all”, “some”, “kilo”)
  • Time related words (days of the weeks, months, general time: "day”, “night”)

Other classes, included:

  • Irregular formations (mostly verbs and quasi-verbs)
  • Kinship terms with irregular plurals
  • Countries and regions

Other roots and stems are being added as I have time.

My original plan to add entries in the dictionary file was to find the most commonly used roots found in Iloko literature. Instead of creating a corpus, a collection of text, I decided to approach Glosbe who has an Iloko corpus. I sent them an email talking about what my aim was over a month ago. I have yet to receive a response to my inquiry. In the meantime, I am investigating statistical models and Iloko blogs whose content I can use in my analysis to focus and refine my efforts rather than entering everything in Rubino’s dictionary.

Goals:

  • 1st public beta - 30-50% recognition
  • 2nd public beta - 50%+ recognition
  • 3rd public beta - 60%+ recognition
  • nth public beta - increasing recognition
  • Version 1.0 - 80%+

With those numbers in mind, I’ve planned on having the first beta by the middle of June, 2018. It will be in the form of an Add-On for Mozilla Firefox, and extension for LiberOffice (6.1.x) and a DSpellCheck, a plug-in for Notepad++. As of right now, the files are posted to GitHub (https://github.com/joemaza/spellcheck/tree/master/dictionaries/ilo) if you would like to view them.

Wednesday, December 13, 2017

Spell Checker - On Reduplication

Introduction

Reduplication is a process of repeating all or part of a root or a stem and signals some sort of grammatical function.

Reduplication in Iloko is very productive and can be applied to various word categories. It marks the following:

  • Plurality in nouns and in verbs (plural arguments in focus)
  • Limitation – “only one”
  • Progressive action in verbs
  • Intensification in adjectives and adverbs
  • Repetition in verbs

Doing the Math

Much of the reduplication in Iloko affects the beginning of the root or stem, involving the first consonant-vowel or consonant-vowel-consonant.

Hunspell rules to need to be written to match these combinations. As a result, rules tend toward the several hundred. (No kidding!) More often than not reduplication also combines with an affix. This base number is then multiplied by the different forms of the affix to show further inflection and further increases the number into the few more thousands. Then, there are alternate forms because of phonological processes (e.g., progressive assimilation) or alternate forms.

The affix file now becomes a great pain to manage because it takes too long to load. Even a simple text editor is not programmed to handle files of such a size.

There are a few hundred affixes in Iloko and many require reduplication. Soon the size of the Iloko affix file is ballooning to a whopping 30 MB. Compared to the Hungarian affix file (7 MB), a language known for its agglutinative morphology, this seems unreasonable and has a direct impact on the load time, something that end user will notice (and complain about).

Reducing the Size

If there were a problem with just one rule, it was difficult to root out the cause. Text editors would hang after loading or saving the file.

The number one goal becomes reducing size and manageability.

I had to rethink the entire approach and experiment.

Redundancy

The first problem with my initial approaches was redundancy. The same set of reduplication rules were repeated. It wasn’t necessary to create rules for each combination of an prefix form and reduplicated syllable.

Why can’t one affix class be responsible for just reduplicated syllables? Why can’t another be responsible for inflectional morphology?

To solve this, I devised a divide-and-conquer approach.

I created an affix class for the reduplicated syllables and had them reference the associated class that contained the prefix forms, which in turn, referenced the proper set of suffixes.

Granted, there are affix classes for reduplication and they share similar rules, if not exactly the same ones in the new affix file. But, the prefixes or suffixes they refer to differ. I tried to stamp out redundancy all together. But, I’m satisfied with this compromise. The reduplication classes have just a couple thousand rules which is easier to manage. If there is a problem, it’s usually the prefix that they refer to.

The next issue to address dealt with redundancy with circumfixes.

Circumfixes Create Bloat

As stated many times, Hunspell only recognizes prefixes and suffixes. That said, it does have a mechanism to recognize a pair of prefixes and suffixes together as one unit to create a circumfix by use of an operational flag. Iloko has a number of circumfixes, so this is a welcome feature.

Regardless, the feature would allow one prefix and one suffix pairing. So, dividing the responsibilities of inflected prefixes rules and reduplication rule between two set was not possible. Again, what was needed was a set of rules numbering in the thousands for each probable permutation.

I didn’t gain much space or peace of mind with circumfixes.

It’s a Prefix… Sort of

In the end, I decided to forego the circumfix feature and treat circumfixes as if they were prefixes.

First, I implemented the divide and conquer approach: One set of rule or an affix class was responsible for reduplication and other classes for the inflected prefix forms – From several thousand, to just under two. This dealt with the left side of the root.

Second, separate entries for –EN or –AN or both were created in the dictionary file. Special suffix rules handled the various pronoun and enclitic forms. As a result, what was repeated needlessly, was reduced to a few affixes classes that could be shared among the prefixes. The only caveat is that the prefixes MUST be used with the right dictionary entry; they already reference the right suffix classes.

Slimmed and Trimmed

I must say that I’m happy with this approach.

What had been a text file, A TEXT FILE, of several megabytes (20+) has been reduced to one of just under four at 3.6MB. It’s also more manageable to edit and to suss problems out. In addition, dictionary load times are shorter, too, something that end users should not notice which is a good thing.

Monday, November 13, 2017

That’s Not Attested!

A while ago when I was pondering the possible combinations of the infix <INN> of Iloko. I thought of two possible ways that it could inflect. I could not determine which was "correct", so I asked for the opinions of Iloko speakers on a Facebook group. I was told that one was possible because of Iloko phonotactics, yet was "unattested" and that other areas where Iloko is spoken might not use the form. There were other suggestions on words to use, but I am of the mind that when asking, I focus on what I presented and I wasn’t not seeking alternatives.

It was acknowledge that the forms did not “violate” the phonotactics of Iloko, although, it was a bit contrived and that it was well-formed, nevertheless.

I prefaced that the response would affect how I crafted the soon-to-be spelling dictionary. Existing literature can be used when testing the spellchecker to ensure that it is working correctly. But, the main purpose of a spelling dictionary is to check for words that are in the process of BEING written. The purpose is NOT to check the spelling of something that has ALREADY been written.

Although the responder meant well, he missed the point entirely. It is this idea of disallowing what is possible for the restrictive confines of what has been that I have had to work with. I can be rather frustrating and it is hard to find someone who can strike a balance between the two as a resource as it is also a cultural critique: What is already establish is “right”; deviations or innovations are undesirable.

Tuesday, June 16, 2015

How to Simulate Reduplication in Hunspell

Reduplication is a process where a part of or the whole root or stem is repeated. It is use for both inflection and derivation in Iloko.

In my approach, reduplicated whole roots or stems are listed as different entries in the *.dic file. First of all, creating a rule to predict the reduplicated form of a whole root or stem is not practical. Second, words derived from this type of reduplication can change lexical categories and meaning, i.e., derivation. For example, saka-saka (adjective) “barefoot” from saka (noun) “foot” is listed as an entry in the *.dic file and it is tagged with the affix classes that can be applied to it.

saka/xy
saka-saka/ab

Hunspell does not have a mechanism for reduplication, whole or partial. It is only capable of only dealing with what it considers prefixes and suffixes. There had to be a way to reproduce reduplication using its framework of rules! Unfortunately, it has deterred at least one person’s attempt at creating a spell-check dictionary for Tagalog as shown in the following thread: helping to implement a grammar checker...[sic].

Nevertheless, reduplication is possible if you “think outside the box”. To simulate the process, we must use the same approach used for infixation: create “compound prefixes”!

Partial word reduplication centers around the first syllable of the root (at least in Iloko and Tagalog). And, it is the syllable that we have to write the rules for. The first syllable can be any one of the following types: V, VC, CV and CVC. (V = vowel; C = consonant).

5 vowels
14 final consonants
25 initial consonants and consonant clusters
? inflectional forms (depends on affix)

All in all, if you do the math, there are 1,750 “possible” syllables, or 1,750 rules that can be created to simulate reduplication! Imagine multiplying that figure with the various inflectional verb forms and that figure can double to 3,500. And, that would be one class! It’s because of this number, I’ve devised automation to assist in creating affix classes.

Another approach is to create another entry in the dictionary with the reduplicated part of the word.

Example:
1) saka
2) saksaka
3) sasaka
4) saka-saka
5) saksaka-saka
6) sasaka-saka


Number one would be the basic form, so any affixes Number two can be used in certain verb forms or it can be used as the distributive plural. Number three can be used for certain verb forms that require only CV reduplication, for example. And, number four is a lexicalized entry that means “to go barefoot”. The other forms can then be used for forms based on “saka-saka”.

What is nice about this approach is that the affix file no longer has to “guess” reduplicated syllables or mutations that may occur because of phonological processes. The con is that the necessary stems need to be created and the appropriate affix flags have to be associated with the other forms which can tax someone who might be adding entries to the dictionary file.

Circumfixes in Hunspell

Circumfixes are a pair of affixes, usually a prefix and a suffix and in some instances in Iloko, an infix and suffix, that must co-occur. Hunspell has the capability of recognizing circumfixes. Simply assign a special flag, and when specifying rules, assign the circumfix flag to the rules of the prefix and assign both the flag of the prefix and the circumfix flag to the rules for the suffix. Simple, huh? Let’s take a look at an example: the location circumfix pag><an.

[*.aff file]
; Circumfix flag. This flag will be used to mark affix as members of a circumfix
CIRCUMFIX X

PFX P   Y    1
PFX P   0    pag/X .

SFX S   Y     1
SFX S   0     an/PX .

[*.dic file]
1
adal/S

“X” is used to designate circumfix pairs throughout the *.aff file. The prefix, pag, is assigned the flag “P” and its single rule has a “continuation” flag, the flag after the “/” following the form of the prefix. The flag of suffix of the pair is “S”. Its only rule is similarly mark, but it has the flag of the prefix. In the dictionary, only the suffix flag is used with the root adal “to learn”.

If we were to try this out on some “words”, we would see result similar to the following when using the specifications above.

pagadal [false]
adalan [false]
pagadalan [true]

Just using either the prefix or suffix alone results in invalid words. The only “correct” form is the third.

So, the above three are the main over-arching considerations that encountered while creating a Hunspell dictionary. And, as I’ve moved along and stepped back only to move forward again, I find further details that make the task challenging.

How to Simulate Infixation in Hunspell

Infixation is where the affix is “inserted” into the stem, usually it is between the first consonant (if there is one) and the vowel of the first syllable. In Iloko and other Philippine languages this type of affix that is very productive and occurs in the many paradigms of many of the lexical categories.

The only affix types that Hunspell recognizes, however, are “prefix” and “suffix”, in other words, affixing to the left and to the right of the stem. Nevertheless, rules can be written to simulate the process.

In Iloko it is rather simple: if the syllable begins with a consonant, insert the infix between it and the vowel, otherwise, treat it like a prefix.

Example
root: sarita – talk, speech
s<um>arita

root: andar – to run (of machines), to function, to operate
um- andar

The maximal syllables in Iloko is CVC, which makes infixation simple. But, with the adoption of Spanish and English loans, there are syllables that begin with two or more consonants. For example, prito ( from the past participle of Spanish freir lit “fried”) is a commonly used word in Iloko and Tagalog. Iloko’s strategy for infixation is to insert before the vowel, i.e. prinito. Initial clusters thus become another consideration.

Hunspell rules have to be written in such as way that appear as if the first consonant (if the first syllable has one) and the infix are a prefix.

PFX I  Y  25
PFX I  0  um   [aeiou]
PFX I  b  bum  b
PFX I  d  dum  d
PFX I  g  gum  g
. . .
PFX I  t  tum  t

The first rule is straight forward:

If the root begins with a vowel, treat the infix as a prefix and attach it to the left side of the root.

So, with uli “to ascend, go up” the result is umuli. But, the remaining rules show how to deal with roots and stems that begin with consonants.

The value in the third “column” are the characters to remove from the beginning of the root. In the first case “b” and “d”. We want to remove it because we will replace the initial letter with what is in the fourth column, “bum”, the initial consonant with the infix in place. The fifth column specifies the condition under which we want the rule to apply. As expected, it is “b”.

Infixation

root: takder – to stand

1) akder (remove the ‘t’)
2) tum (assemble the pseudo-prefix)
3) tumakder (add to root’s left-word edge, the beginning)

With this strategy, rules for simulation the process of infixation can be written and accounted for. The number of rules needed, however, is determined by the number of possible onsets in Iloko: 14 single consonants and 10 clusters, so 24 distinct rules.

Thursday, July 3, 2014

A Change in Course

Background

Previously, I had the mind of “predicting” how the first syllable of a root is reduplicated. It seemed simple enough, just figure out what can be an onset or onsets (Spanish and English loans), what the vowels are (“a”, “e”, “i”, “o” and “u”, orthographically) and what can be codas (“h” and clusters excluded). But, as I automated the process of creating the possible syllables in Iloko, I was astounded by the shear number of rules each affix class would require.

Hunspell doesn’t have the mechanism to properly analyze reduplication. So, I devised a way in which it would be able to analyze reduplicated syllables: Fuse the prefix and the reduplicated syllable into one pseudo-prefix!

Example: agbasbasahe or she is reading/learning

In the example above, the first syllable (“bas”) of the root basa (“to read/learn”) is reduplicated and the verb prefix “ag-“ is attached.  Likewise, other verb prefixes follow the same pattern. (Iloko and Tagalog, for that matter, are prefix heavy.) This made it necessary to create a rule that had a form of the affix and the reduplicated target syllable that matched. Hunspell would analyze the words as “agbas-“ (prefix) and “basa” (root/stem).

PFX X 0 agbas bas ag+[RED: bas]
. . .
PFX X 0 agwuw wuw ag+[RED: wuw]

Iloko syllables can be minimally V and maximally C1C2VC, in other words, anywhere between a vowel and a closed syllable with an onset of consonants. In the history of the language, however, CVC was the maximal syllable. It was when Spanish words enter the lexicon did C1C2-type syllables became common. The C2 is commonly /l/ or /r/.

First Approach

My first though was to create an affix class of just reduplicated syllables. Any prefix could just reference this class and it would not be necessary to generate the same syllables for each and every prefix. Hunspell is capable of analyzing two prefixes or two suffixes, but not both. In addition, some affixes could only be used with a restrictive set of other affixes and circumfixes posed a problem because prefix X could only occur with suffix Y and with another set of suffixes. So, I opted to automate the process of creating rules all the “possible” syllables combined with the affix.

Writing the automation was rather straight forward: Generate rules with the prefix and V, VC, CV, CCV, CVC and CCVC syllables. After just one affix class, there were well over a thousand rules!  I scanned rule after rule and wondered if there were any roots that began with a syllable such as “yoy” or “bluw”. I haven’t performed an analysis of the structure of Iloko roots, but I doubt that any exist. The automation produced doubtful forms as a precaution, even if they eventually were not used.

Prefixes with Final Nasals

Prefixes, such as “mang-“ and “pang-“, that have final nasals, further complicating the process of predicting the form of the reduplicated syllable. Typically, the final nasal, “-ng”, of these prefixes assimilates to the point of articulation of the initial consonant of the root (if possible). Roots that start with vowels were exempt. With some roots, however, not only did the nasal assimilate but the onset would become said nasal resulting in a geminate.

pang- + dait > pan- + dait (nasal assimilation) > pannait (consonant becomes nasal).

Or, in some cases, the geminated consonant simplified.

 pang- + sao > pan- + sao > pan- +nao > panao

Assimilation affected the first syllable in non-reduplicated forms. But, when reduplicated, the reduplicated syllable was affected, while the original syllable was unaffected, e.g. “pan(n)adait” (Rubino 2000). Whether gemination or deletion of the initial consonant occurred was unpredictable, so rules for both were added, again, as a precaution since it “could” be possible and increased the number of rules.

The few issues I faced complicated the automation. In one day I spent two hours trying to debug it. Yet, it was well worth it and it took less then a minute to produce an affix file. Imagine having to write so many rules without error and by hand!

Line Count: 80,000!

As I added more affix classes, I noticed that the number of lines steadily increased. With perhaps 60 –70 of the verb affixes automated, the affix file had grown to 80,000 lines! The file itself was an astounding 7 MB with comments and 3 MB without. I compared other affix files for other languages and found that many were smaller. Hungarian, known for its agglutinative morphology, has an similar affix file with about 90,000 lines. But, with only 60 –70 of the affix classes complete, I knew that the Iloko file would reach almost 150,000 and possibly 7MB (without comments). When I tested the affix file against a few entries in the dictionary file, I noticed that there was lag. So, after more consideration, I’ve decided to take another approach.

Reasons for a Bloated Affix File

Initially my reasons for predicting the reduplicated syllable in the affix class was to reduce the number of roots that were needed and reduce the number of affix classes. Each root in the dictionary file can only take certain affixes. So, by reducing their number it was easier to assign the right affix to the proper root. But, I’ve found that the certain issues had become more and more apparent as the affix file grew.

The classes grew astoundingly in the number of rules. It became difficult to properly check them or investigate if a rule did not “behave” as expected. Reduplication was difficult to debug.

With only a few entries in the dictionary file, the analyzer took longer to parse a small set of test words. I had a nagging feeling that overall performance would be affected. If this can happen with only a handful of *.dic file entries, can you imaging with several thousand. The rules for the “doubtful” syllables might be slowing analysis down.

After comparing other affix files for other languages, 3MB did not seem acceptable.

In the end, I decided to move the reduplication into the dictionary file. The reduplicated stem reduced the complexity and number of rules in the affix file, but it shifted the “responsibility” of explicitly specifying the reduplicated stem in the dictionary thereby dividing the work.