As I wrote in another post, a Hunspell dictionary is the way to go in order to create a spell-check dictionary for Iloko. Not only is implementing one relatively “easy”, but there are a plethora of languages that have dictionaries available. Because the dictionaries are open, I can gain insight about how they created their rules and apply the same techniques to Iloko.
According to the Hunspell manual for creating affix files and dictionary files, the first line of a dictionary file is an approximate number of entries and the rest of the lines are entries of stems and/or roots followed by the flags of the affix classes that are applicable. Again, the name of each file is the code for the target language, for example, “ilo.aff” and “ilo.dic” for Iloko. More information about the affix file (*.aff) can be found in this entry. But, for now we will focus on the dictionary (*.dic) file and some of the issues and considerations in creating one.
Format
The format of the entries in the *.dic file is quite simple: entries and their attributes are written on one line.
Example:
7
adda/ABC
ama/XYZ
ina/XYZ
anak/XYZ
puso/XYZ
magna/AGH
saan/KLM
In the example above, there are seven entries and each entry rests on a single line. The valid affix classes that apply to the entry follow the form after a slash (“/”). By default each affix class flag is only one character long as in the example, e.g. “A”, “B” and “C” for the first entry. Two-character or numbers can be used as affix class flags. Their use is signaled in the *.aff file.
Although I have not reached the maximum number of single-character flags, I can imagine that the number of Iloko affixes, especially prefixes and reduplication (I will explain later), can reach that maximum. As a result in embarking on creating a spelling dictionary, I enabled two-character flags.
Issues to Consider
As entries are added to the *.dic file, there are a few issues to consider while they are added. The first two have parallels in English, but the last one is specific to Iloko and possibly to other Philippine-type language.
Categorization
In English each word belongs to one, two or more parts of speech or lexical categories, such as “nouns”, “verbs” and “prepositions”. Iloko roots and stem can be categorized in the same way, but they are more fluid. The root tennis is a noun or can be used as an adjective, e.g. tennis shoes. However, unlike most nouns it cannot take the “-(e)s” plural *tennises, because in practice there is only one sport called “tennis”. Iloko has borrowed the word as tenis, and it can be classed as a noun as well, but it can be verbalized with the prefix ag-, e.g. agtenis “to play tennis”, an insight into the versatility of Iloko.
There are many such roots in Iloko which are nouns that can become verbs through affixation. Even words that would normally not have an “affix” may take them, e.g. wen “yes” versus wensa “maybe yes”. Technically “-sa” is an enclitic. Hunspell treats anything written after the stem or root as a suffix.
Great care must be taken to assign the right affix class to a root because its semantics must be considered. Just because one verbal affix can be used with one nominal root, does not mean that it will work with all.
differing stems
In English, the past tense for sing changes to sang in the past because of ancient Germanic ablaut or apophony. Ablaut can also be used to change the lexical category of the word, such as abode (noun) and abide (verb). Luckily for English speakers, sang does not have any further terminations, and the pair abode/abide are members of differing categories.
Iloko has roots that change as well. In Iloko, however, syncope, loss of a unit of sound, is the most common because of shifts in stress after affixation. In a few cases, there are two root forms within the same paradigm! For example, in pa-dakkel “to make big, enlarge” the root, dakkél, changes to dakl- because suffixation.
pa-dakl-en (neutral)
p<in>a-dakkel (perfective)
The reverse of syncope, epenthesis, or the addition of vowels or consonants for euphony, is very rare. When the prefix, ipa-, is affixed to serrek “entrance, work”, the result is ipastrek “to have, allow to enter” where a “t” is inserted; ipaserrek, however, is a permissible form.
Different stems pose a problem. They can be listed along with the original root in the *.dic file, but each must be assigned the correct affixes. And, because sometimes the differing stems can occur in the same paradigm, the affix class must be split into sub-classes. Another approach is to “predict” the syllable that will be deleted using the rules in the *.aff file.
In my approach, I opted to list the reduced stem instead of over-burdening the *.aff file. But, as a compromise affix classes that use both the full root and the reduced (or augmented, in the case of stem with an epenthetic phoneme) be added as a separate entry with the appropriate affix classes that apply.
Reduplication
Reduplication is an integral process in Iloko and the morphologies of many of the Philippine languages. Partial reduplication, usually the first syllable, is the most common means to produce a different form. In English, this only occurs in very few words and is not part very productive process, e.g. “zig-zag”, “ping-pong”, etc.
In the approach that I’ve chosen, roots that are entirely reduplicated (e.g. sakasaka “barefoot” < saka “foot”) are listed in the *.dic file mainly because they have a differing meaning than the original root and that they do not follow any predictable patterns. Partially reduplicated stems, on the other hand, are not listed in the *.dic file. As we’ll see in another post, I will explain why I’ve not treated partially reduplicated stems as “stems”. Instead, they are treated as part of the affix – they are predominantly prefixes.
No comments:
Post a Comment