[Pkg-ime-devel] RFS: scim-waitzar, libwaitzar (re-submission) Attn: Paul Wise
sorlok_reaves at yahoo.com
Wed Jan 21 07:09:08 UTC 2009
> So which of these are used for creating the Myanmar.model
None of them, actually. Creating Myanmar.model takes a few steps:
1) Copy all Burmese words from Myanmar_List_v2.txt into Myanmar.model
2) For each word, create and store a reverse-look-up in Myanmar.model
(The next few steps are optional)
3) For a given corpus, scan each word and count its frequency. Then, compute bigram and trigram frequencies. (I currently use a Java script for this).
4) Prune out uni/bi/trigrams which are considered "useless" (matter of opinion; again, I use a Java script to help me). Store uni/bi/trigrams in Myanmar.model
(The next steps are quality assurance)
5) Go over the model by hand, checking for errors and out-of-order encoding.
6) Use the KaNaung code to convert each word into our three output encodings. Visually check that these all look the same.
Two unattractive properties of this process:
1) It requires a lot of manual intervention (for QA, which I feel is important).
2) The Java scripts I use were written before Unicode 5.1 came out, so I used the Zawgyi-One encoding internally. This encoding is non-standard, and requires a great deal of knowledge to use properly. (This is one of the main reasons I am not comfortable releasing my Java helper scripts --I don't feel right promoting the use of a broken non-standard encoding).
I suppose a long-term goal would be to release a set of Unicode 5.1, waitzar-specific trigram generators; however, this is really just a pipe dream for now --it would be a huge amount of work, with very little benefit.
PS: When I say "a Java script", I mean "a script written in the Java programming language".
More information about the Pkg-ime-devel