Abstract

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Highlights

  • Statistical Machine TranslationStatistical Machine Translation (SMT) is one of the best performing corpusbased approaches to natural language processing (NLP)

  • The phrase-based SMT systems perform significantly better than the word-based systems, they still face a lot of problems

  • They have reported significant improvements on the standard metrics that are used for MT evaluation by adding a disambiguation module that is integrated in a phrase-based SMT system

Read more

Summary

Introduction

Statistical Machine Translation (SMT) is one of the best performing corpusbased approaches to natural language processing (NLP). The idea of linking the meaning of a word to its context has a long history that starts with the distributional theory of meaning, which links the meaning of a word to its distribution and further states that two words are distributionally similar if they appear in similar contexts This theory of meaning goes back to Harris’ Distributional Hypothesis (Harris 1968), suggesting a direct link between distributional similarity and semantic similarity: two words that tend to occur in similar contexts tend to have similar meanings. This idea is exploited by lexicographers today, who use corpus evidence for creating dictionaries. The remainder of the paper shows how different generations of Machine Translation systems have tackled the major problems MT is confronted with

Machine Translation
Statistical Machine Translation
Word Alignment
Phrase-based statistical Machine Translation
Word Sense Disambiguation in Statistical Machine Translation
Word Sense Disambiguation
Word Sense Disambiguation approaches
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call