Thursday, May 17, 2018

Iloko Spell Check: Status

It has been a very long time since I posted about the Iloko spell check.

Hunspell

As a refresher, the spell check is composed of rules to use with Hunspell spell-checker and morphological analyzer library (Wikipedia, Github) to verify the spelling of words for a specific language or language variant. There are many languages that already have spell-check dictionaries available. Hunspell is integrated with many software titles, some of which you may be using already. One that comes to mind is Mozilla Firefox. (I already have an Add-On to use with Firefox in the works. But, that is for another post.) Another two that also come to mind are LibreOffice and OpenOffice.org. Both are productivity suites for word processing, data base management, slideshow, etc.

Files

In order for Hunspell to verify words, two files are needed. These are plain-text files and can be read/written with any text editor. Once you understand how the file is formatted, it is easy to understand how everything relates.

The first of the files is the “Dictionary” file (ilo.dic). It is the simplest of the files and contains a list of “entries”, roots or stems, that the spell-checker can recognize to create correctly spelt words. Each is associated with “flags” or tags that specify attributes or operations that can be performed on the entry. For example, a flag can specify that the entry is only valid if capitalized. So, “Philippines” is valid, but “philippines” is not. (Even in my application, the former is underlined with the ubiquitous “red, squiggly line” indicating that it is incorrect.) Or, the entry can be flagged as one that the spell-checker should recognize but it should not suggest. These are usually “non-standard” words or offensive words. But, the majority of the flags refer to “affixes” that can be applied to create valid words. These affixes are specified in the other file.

The other file is the “Affix” file (ilo.aff). As the name suggest, it contains information about the affixes used in the language. It also includes other basic information about the target language, information such as the language (or variant) name and which script the language is written in. But, the bulk is “affix classes” or groups of rules that specify how they can combine with entries in the dictionary file. Examples of affixes in English are the ‹-ed› for the past tense in English or the ‹re-› prefix that shows repetition. The affixes can be as simple as the two just mentioned, but something like the “-s” plural require a few rules to specify how it is applied. So, there are rules to turn ‹y› to ‹ie› or add ‹e› after a word that ends in ‹ch›, ‹s›, ‹x› or ‹z› before adding ‹s›.

Iloko

After a few attempts, I believe that I’ve struck upon a system to organize the affixes in the affix file for Iloko. Simply put, it uses a divide-and-conquer methodology.

Reduplication, a process where part or all of a root or stem is repeated, is a major consideration and contributed to the size of the file. Since Hunspell does not have a built-in mechanism, rules targeting specific syllables are needed. In my latest scheme, sets of reduplicate syllables are considered as just affix classes or “duplifixes”, repeate parts of a stem that carry meaning. And, the affixes “proper” constitute as their own affix classes. (More about this in another post.)

Currently, the affix file is rather extensive and still a bit larger than even Hungarian’s (77KB), a language notorious for its agglutinative morphology. Iloko’s stands at five MB and takes time to load. The culprit  is reduplication as mentioned previously. It is very productive in Iloko and signals various grammatical categories (e.g., pluralization, aspect, etc.), and as such, separate rules are needed to match reduplicated syllables. Many of the affix classes have as many as 1,200 rules. Up for consideration is that some of the affixes could be eliminated and the words they create (or predict) be separate entries in the dictionary file. Nevertheless, the current schemes appears to be working after some testing. And, as I continue testing anything that is awry is diagnosed and fixed. But, so far there haven’t been any further issues that I’ve found with the affixes nor how their rules are specified.

The dictionary file is slowly growing as I add entries and associate the applicable affixes. But, this is a meticulous task because not all roots or stems can take all affix classes. I am using Carl Rubino’s Ilokano Dictionary since it is root-based and takes away any guessing. At the same time, I have to rely on my own recollection (good or bad) to assign affix classes. Currently, there are about 3,000 entries. They include the following closed classes:

  • Functional words (conjunctions, adpositions, the ligature)
  • Determinatives (e.g., “this”, “those”, “the”)
  • Pronouns (e.g., “I”, “them”)
  • Interjections (e.g., “Susmaryosep!”, “Ukinnamon!”)
  • Numbers (including Spanish numbers) and quantifiers (e.g., “all”, “some”, “kilo”)
  • Time related words (days of the weeks, months, general time: "day”, “night”)

Other classes, included:

  • Irregular formations (mostly verbs and quasi-verbs)
  • Kinship terms with irregular plurals
  • Countries and regions

Other roots and stems are being added as I have time.

My original plan to add entries in the dictionary file was to find the most commonly used roots found in Iloko literature. Instead of creating a corpus, a collection of text, I decided to approach Glosbe who has an Iloko corpus. I sent them an email talking about what my aim was over a month ago. I have yet to receive a response to my inquiry. In the meantime, I am investigating statistical models and Iloko blogs whose content I can use in my analysis to focus and refine my efforts rather than entering everything in Rubino’s dictionary.

Goals:

  • 1st public beta - 30-50% recognition
  • 2nd public beta - 50%+ recognition
  • 3rd public beta - 60%+ recognition
  • nth public beta - increasing recognition
  • Version 1.0 - 80%+

With those numbers in mind, I’ve planned on having the first beta by the middle of June, 2018. It will be in the form of an Add-On for Mozilla Firefox, and extension for LiberOffice (6.1.x) and a DSpellCheck, a plug-in for Notepad++. As of right now, the files are posted to GitHub (https://github.com/joemaza/spellcheck/tree/master/dictionaries/ilo) if you would like to view them.

No comments:

Post a Comment