-- Say Johnny, what would you like to get on your birthday? A book maybe?|
-- Naaah, don't wanna. I've already got one.
There is Windows executable (2005-07-13 version) only for now, however Linux version is available -- if you are interested just drop me a note. Please keep in mind it is still a PRELIMINARY version. Uk requires Gtk+ and Gtkmm runtimes.
After downloading, run the application, select© any text and click on "Correct" button. When you paste text back into your selection it should be corrected.
Uk is Polish spell-checker narrowed to errors related to dyslexia -- i.e. it corrects "ż-rz", "h-ch", "ó-u" mistakes.
It is automatic which means that it does not bother you with annoying questions and on the other hand it does not give you the chance to fix the "correction".
It is context-sensitive which means that it should not mistake words like "morze" and "może". There are (surprisingly) a lot of such pairs in Polish, just to show few of them:
|mażę -- marzę||żądzą -- rządzą||kuł -- kół||waży -- warzy||waży -- warzy||Hur -- chór|
|może -- morze||wieży -- wierzy||drug -- dróg||rurze -- róże||lud -- lód||każe -- karze|
|półki -- pułki||rud -- ród||brudno -- Bródno||Bóg -- Bug||bród -- brud|
It learns the correct spelling (see the next section) and therefore Uk still does not recognize a lot of such pairs, it also cannot recognize much of the obvious mistakes but such sentences:
✖ tak rzeby rzołnieże bóżyli hmóry bżękiem nug i huralnie rurznili muzgi rzóhwami ostrug chien i hrobotuw ✖
are rather easy to correct:
tak żeby żołnierze burzyli chmury brzękiem nóg i chóralnie różnili mózgi żuchwami ostróg hien i chrobotów.
Note however that both
✖ a nórz pojadę morze nad może jórz ✖
✖ a nórz pojadę może nad może jórz ✖
just leave the first "może/morze" untouched -- Uk cannot decide which one is better and in such case it does not do anything. The equal estimation of spelling is due to small context dictionary (below you can see the result of the latter one):
a nuż pojadę może nad morze już.
I plan to use Uk as a plugin for Usenet programs -- for writing users to ensure better quality of postings and for readers to protect their eyes :-) from bizarre pigeon-Polish which is hard (and painful) to read but also which is in long run a threat to one (at least mine) spelling skills.
Well, the core stuff is preparing the dictionary and at first it looks fairly easy to do
-- besides I had a bad month or two plus too much stress plus curse
of some kind had to be involved :-)) (that is why I started it in spring and in summer
it is still quasi-demo) I was obviously misled with the victory-almost feeling.
The dictionary is not just a simple list of words, it is based on real text so after getting the Polish corpus I extract only words. Then I can build n-grams such long as it makes the difference for prefixes and suffixes to distinguish words with opposite spelling -- for prefixes it could be "d#mo[rz]" vs "ć#mo[ż]" ("#" is word separator and the key characters are marked with square brackets).
It is the first challenge since such amount of data organized in nice structures very fast exhausts
memory in my computer so it has to be well tuned. Ok, we get n-grams but it does not really help
a lot -- the n-grams file is about 50% of initial corpus!
Now it is time for "compression" (see below) -- the result is about 1.2MB big. I think it suits all modern and a bit older computers.
There are however still unresolved problems -- ignoring words with errors in the first place (there are often put as an example of bad writing like "do not write »rzółw«", there are also names which are mispelled intentionally like "Formacja Nieżywych Schabuff") and learning of new words (the corpus I have is not sufficient for every possible word in Polish). I cannot do anything with the latter (I can only excuse myself telling you to think about Uk as a baby) but the first one is quite interesting.
It has nothing to do with normal compression -- it is just pruning the data which is unnecessary. For words like "mo[rz]e" and "mo[ż]e" you have to know the context but there are many words for which the context can be empty. And more -- you can just take a few letters and program still would work fine. For example -- "fakt[ó]w" and "fakt[u]r"; you do not have to store whole words, here only the first letter from the suffix makes the difference. This way you get just "[ó]w" and "[u]r".
It is tempting (and indeed I fell for it) to consider the pruning
from pure algorithmic perspective. One could easily notice that
optimal pruning (i.e. resulting in smallest possible n-grams data) has
complexity O(N!) where N is the number of words (there is no point in pruning n-grams
consisting of word+context since they already need more data). Then
the suboptimal algorithm is worth considering -- for example ordering
pruning depending where bigger savings are -- for "ch/h" one could
gain more by first pruning prefixes not suffixes.
Such attitude neglects the fact we are dealing with live data -- in Polish due to inflected forms you have much more suffixes than prefixes: "żu[ch]wa", "żu[ch]wę", "żu[ch]wie", "żu[ch]wami", etc. so it is more appropriate to cut suffixes first. You could argue "don't you know all those inflected forms and if you do is it not clear that any method should show ~1:14 saving in favor of suffixes?". Yes and no :-). Yes, I do know inflections but program does not. The sources are not perfect -- they are both small and messy. All the program see is (for example) "żu[ch]wie" and "żu[ch]wę", there is no way (in such simply program) you can properly match this against "żu[ch]wa" if you cut prefixes instead of suffixes.
So the algorithm is pretty straightforward -- it cuts suffixes (O(N)), then reverses n-grams (O(N)) and cuts prefixes (O(N2)).
Considering algorithmic side I would say Uk is well settled now, of course there are
several annoyances to fix I am aware of like extending dictionary to handling ASCII-7bit alphabet
(very popular on Usenet). Smart context recognition would be nice too -- when you type "hur"
Uk changes it to "chór" because the case here has higher priority. However it is not clear it is correct,
changing the case is just 1 letter. This shortcoming it is even more clear for "ben hur".
Once you start writing any program the "to do" list becomes endless -- it is common rule and I do not want to be trapped in minor fixes. Now it is a good time to build companion program to automatically learn Polish morphology. Filtering the corpus and accepting only well-formed words is another interesting task.