Tuesday, June 16, 2015

How to Simulate Reduplication in Hunspell

Reduplication is a process where a part of or the whole root or stem is repeated. It is use for both inflection and derivation in Iloko.

In my approach, reduplicated whole roots or stems are listed as different entries in the *.dic file. First of all, creating a rule to predict the reduplicated form of a whole root or stem is not practical. Second, words derived from this type of reduplication can change lexical categories and meaning, i.e., derivation. For example, saka-saka (adjective) “barefoot” from saka (noun) “foot” is listed as an entry in the *.dic file and it is tagged with the affix classes that can be applied to it.

saka/xy
saka-saka/ab

Hunspell does not have a mechanism for reduplication, whole or partial. It is only capable of only dealing with what it considers prefixes and suffixes. There had to be a way to reproduce reduplication using its framework of rules! Unfortunately, it has deterred at least one person’s attempt at creating a spell-check dictionary for Tagalog as shown in the following thread: helping to implement a grammar checker...[sic].

Nevertheless, reduplication is possible if you “think outside the box”. To simulate the process, we must use the same approach used for infixation: create “compound prefixes”!

Partial word reduplication centers around the first syllable of the root (at least in Iloko and Tagalog). And, it is the syllable that we have to write the rules for. The first syllable can be any one of the following types: V, VC, CV and CVC. (V = vowel; C = consonant).

5 vowels
14 final consonants
25 initial consonants and consonant clusters
? inflectional forms (depends on affix)

All in all, if you do the math, there are 1,750 “possible” syllables, or 1,750 rules that can be created to simulate reduplication! Imagine multiplying that figure with the various inflectional verb forms and that figure can double to 3,500. And, that would be one class! It’s because of this number, I’ve devised automation to assist in creating affix classes.

Another approach is to create another entry in the dictionary with the reduplicated part of the word.

Example:
1) saka
2) saksaka
3) sasaka
4) saka-saka
5) saksaka-saka
6) sasaka-saka


Number one would be the basic form, so any affixes Number two can be used in certain verb forms or it can be used as the distributive plural. Number three can be used for certain verb forms that require only CV reduplication, for example. And, number four is a lexicalized entry that means “to go barefoot”. The other forms can then be used for forms based on “saka-saka”.

What is nice about this approach is that the affix file no longer has to “guess” reduplicated syllables or mutations that may occur because of phonological processes. The con is that the necessary stems need to be created and the appropriate affix flags have to be associated with the other forms which can tax someone who might be adding entries to the dictionary file.

Circumfixes in Hunspell

Circumfixes are a pair of affixes, usually a prefix and a suffix and in some instances in Iloko, an infix and suffix, that must co-occur. Hunspell has the capability of recognizing circumfixes. Simply assign a special flag, and when specifying rules, assign the circumfix flag to the rules of the prefix and assign both the flag of the prefix and the circumfix flag to the rules for the suffix. Simple, huh? Let’s take a look at an example: the location circumfix pag><an.

[*.aff file]
; Circumfix flag. This flag will be used to mark affix as members of a circumfix
CIRCUMFIX X

PFX P   Y    1
PFX P   0    pag/X .

SFX S   Y     1
SFX S   0     an/PX .

[*.dic file]
1
adal/S

“X” is used to designate circumfix pairs throughout the *.aff file. The prefix, pag, is assigned the flag “P” and its single rule has a “continuation” flag, the flag after the “/” following the form of the prefix. The flag of suffix of the pair is “S”. Its only rule is similarly mark, but it has the flag of the prefix. In the dictionary, only the suffix flag is used with the root adal “to learn”.

If we were to try this out on some “words”, we would see result similar to the following when using the specifications above.

pagadal [false]
adalan [false]
pagadalan [true]

Just using either the prefix or suffix alone results in invalid words. The only “correct” form is the third.

So, the above three are the main over-arching considerations that encountered while creating a Hunspell dictionary. And, as I’ve moved along and stepped back only to move forward again, I find further details that make the task challenging.

How to Simulate Infixation in Hunspell

Infixation is where the affix is “inserted” into the stem, usually it is between the first consonant (if there is one) and the vowel of the first syllable. In Iloko and other Philippine languages this type of affix that is very productive and occurs in the many paradigms of many of the lexical categories.

The only affix types that Hunspell recognizes, however, are “prefix” and “suffix”, in other words, affixing to the left and to the right of the stem. Nevertheless, rules can be written to simulate the process.

In Iloko it is rather simple: if the syllable begins with a consonant, insert the infix between it and the vowel, otherwise, treat it like a prefix.

Example
root: sarita – talk, speech
s<um>arita

root: andar – to run (of machines), to function, to operate
um- andar

The maximal syllables in Iloko is CVC, which makes infixation simple. But, with the adoption of Spanish and English loans, there are syllables that begin with two or more consonants. For example, prito ( from the past participle of Spanish freir lit “fried”) is a commonly used word in Iloko and Tagalog. Iloko’s strategy for infixation is to insert before the vowel, i.e. prinito. Initial clusters thus become another consideration.

Hunspell rules have to be written in such as way that appear as if the first consonant (if the first syllable has one) and the infix are a prefix.

PFX I  Y  25
PFX I  0  um   [aeiou]
PFX I  b  bum  b
PFX I  d  dum  d
PFX I  g  gum  g
. . .
PFX I  t  tum  t

The first rule is straight forward:

If the root begins with a vowel, treat the infix as a prefix and attach it to the left side of the root.

So, with uli “to ascend, go up” the result is umuli. But, the remaining rules show how to deal with roots and stems that begin with consonants.

The value in the third “column” are the characters to remove from the beginning of the root. In the first case “b” and “d”. We want to remove it because we will replace the initial letter with what is in the fourth column, “bum”, the initial consonant with the infix in place. The fifth column specifies the condition under which we want the rule to apply. As expected, it is “b”.

Infixation

root: takder – to stand

1) akder (remove the ‘t’)
2) tum (assemble the pseudo-prefix)
3) tumakder (add to root’s left-word edge, the beginning)

With this strategy, rules for simulation the process of infixation can be written and accounted for. The number of rules needed, however, is determined by the number of possible onsets in Iloko: 14 single consonants and 10 clusters, so 24 distinct rules.