It looks pretty much like an ordinary dictionary -- each entry consists of one-to-many word translations. Source word is just a look-up key when target word is plain word (mostly for nouns) or word plus some extra information (mostly for prepositions and verbs).
Target word can be described with its requirements to previous or next words (not in linear sentence but in already parsed tree). This way you can distinct for example the meaning of make-do and make-force. You can also put some extra information which enrich translation of missing (comparing English to Polish) words.
There is also a collocation table but now it is in not-many-data-in-here stage (lack of time); and some minor tables to make life easier for me.
First of all I am currently focused on parsing subengine so all others parts are more or less lousy done. Now, step by step...
Input line is divided into sentences, each sentence is handled separately.
Each word in sentence is searched for in the dictionary -- if not found Margot tries to guess the POS-tag for it. If not succeeded it is assumed it is a noun.
For each word all possible POS-tags are assigned and then process of elimination tries to cut down some not-likely tags. It is classic rule-based system, I would like to use something like Eric Brill TBL but again I have no time for that, but I know that is a lot of potential in it (and it is really interesting stuff too, because I am eager to test GA; plain statistical POS-taggers don't appeal to me).
Current sentence with tags are now enumerated (some words still have not only one tag). Each enumeration is weighted by its likelihood (calculated from the tag propabilities) -- the most likely goes first; enumerations are indepented from each other.
I will use term "sentence" for each enumeration since it is as the matter of fact sentence. Ok, the parsing step. Linear sentence is replaced by the tree -- of course some parts can be preserved. Replacement occurs only if some words are dependend from each other, for example instead of linear "brown fox" I use "fox" with sub-word "brown". This method is described almost in any textbook of NLP (however there are differences).
After that there is English-Polish alteration of the tree -- some subtrees are illegal in Polish so I convert them, sometimes with tag renaming (most obvious "John -> was -> told" -- in Polish "tell" cannot be put in passive voice in such manner).
From now on I assume I have correct parse tree and I just have to choose the best translation I can get. It means if the only translation is obviously wrong I still keep it but I increase penalty points for it. Now basing on the dictionary entries I try to choose target words which fit best.
All total scores are compared and best enumeration is choosen -- no matter how badly things look. Now I simply translate word for word adding some missing (in English) features -- like gender -- and... that's it.
The most promising method for me is SMT (see some papers by Kevin Knight) but I cannot
use it because of lack of time&resources.
Classic rule-based engine has still a lot of potential but... not manually crafted rules. I am wondering about something self-developing, using GA, where are no clear rules and almost everything (I am referring here to tagging+parsing) is done on the fly -- I try to analyze how I (human) read text in English, how I disambiguate tags, how I trace back sentences. It is like searching for optimal tree through solution space (of course while reading books you don't use those terms) and I just guess I could use some techniques I developed while working on TSP. I'll try it out... someday.