Machine Translation : A Brief Discussion

Machine translation has been a goal of computer science almost as long as Artificial Intelligence. Both seem a perfect task for computers. Take a source language, look up the words in the source language and find the target language equivalent. Then rearrange the found words in the target language. Thus we have machine translation. Easy!

As it turns out not easy. Computer scientist have been working on this problem for 60 years and it still does not work properly. Language is a devilishly complex and constantly changing thing, language conveys subtle meanings and emotions that computers just do not understand. I asked a translator friend of mine to go through the steps he used when translating a document.

1) “Firstly”, he said. “If you are not familiar with the source material subject matter you do have a hope”. basically even if you speak two languages fluently that does not make you a good translator. you have to have a good knowledge of the subject the text is about. Obvious really. If you know nothing of Nuclear reactors how you can you translate a safely manual for one?

2) “You have to read the source document several times to make sure you understand all of the terms in a document”. Actually this can be very time consuming especially with technical material since you have to look up these new terms to get a firm understanding of what each term is.

3) “Once you know all the terms in a document, only then can the translation begin. Each sentence is read and the meaning understood, then a sentence is made that conveys the same meaning as the source. ” Not using the equivalent words, the equivalent MEANING.

When you think about this process in computer terms it means the following:

Computers must have a comprehensive dictionary of words and terms from general language and all specialist fields. This is a tall order by itself, getting translations for all specialist subject could easily prove impossible.

Computers must be able to understand a sentence on a deep level. The computer must know the difference between a reactor(Nuclear), a reaction(from a person), a reaction(chemical) and a reactor(chemical). I’m betting that a large percentage of you out there have no idea what a “reactor” is when used in chemistry. If you as a person do not have a complete grasp of all the meanings of various words it becomes easy to see why computers have such a hard time.

Computers must be able to generate all the sentences a person can. If you think about that one for a minute. Assuming English has up to 1 million words (including plural, place names etc.) then given a typical sentence of 10 words, there are hundreds of billions (more) of possible sentences that can be made. If a computer is going to be able to translate effectively it must be able to generate this number of sentences.

You should by now have an idea why nobody has produced a system that works. Over the years there has been two main avenue of research. Statistical and rule based translation. Both of these have seen progress over the years and both methods are used in most production systems today use. The statistical system is the easiest one to test out. It’s method is simple. Look up the source words as individual words and as groups. Find the translation of each word or group. Rearrange the translated words based on their likelihood of following each other base on the statistics culled from studying the target language word order.

Actually this works quite well for simple sentences but as soon as you try it with longer sentence it quickly reduces to incomprehension. Why? The statistical approach is fundamentally flawed. Word order is not controlled by likelihood, it is controlled by the meaning of each word. So to accurately predict what word can come next you have to know what the previous words actually mean. This basic fact means that all statistical systems will NEVER work.

The other area is rule based translation. This basically assigns a word type to each of the input words (Noun, Pronoun) etc., then translates the words. Then using the rules for word order of the target language, rearrange the translated words to give an output sentence. This is a more precise approach but it still has a number of problems. It is very difficult to know the exact order of any given word unless you know all the other words used in a sentence.

This is a massive task since as I explained before there are billions of potential sentences and to have a rule that fits for all of these is unlikely. What do you do with unknown words? If you come across a word unknown to the system, defining what type of word it is become tricky and will almost certainly lead to incorrect word order.

Most of the current systems actually use a mix of both of these approaches and whilst they can produce results, they will never give the results we all want. Why is that? It is because the human brain does not work this way. Word order is not dictated by statistics or by word type. Word order is dictated by word meaning and context.

The only way that computer translation is going to work is if it understands what each word is and how it fits in the world.

In short machine translation will only work when computers understand the world.

We have now wandered into the world of Artificial Intelligence and it is true the two fields are deeply entwined. As I said when I started, machine translation and Artificial intelligence were one of the original goals of computer science as it turns out, they are the same thing.

I am aware of course that it is easy to talk about AI and MT. The proof as they say is in the pudding. I am working on a number of lines of research in both MT and AI and will be uploading software over the next few months that show my progress so far.

Thanks

Daniel Burke.

Daniel Burke is a IT Consultant and AI researcher.

http://www.danielburke.com

Daniel.burke@yahoo.com

Leave a Reply

You must be logged in to post a comment.