The Geometry of Statistical Machine Translation

Aurelien Waite,Bill Byrne

doi:10.3115/v1/n15-1041

Abstract

Most modern statistical machine translation systems are based on linear statistical models. One extremely effective method for estimating the model parameters is minimum error rate training (MERT), which is an efficient form of line optimisation adapted to the highly nonlinear objective functions used in machine translation. We describe a polynomial-time generalisation of line optimisation that computes the error surface over a plane embedded in parameter space. The description of this algorithm relies on convex geometry, which is the mathematics of polytopes and their faces. Using this geometric representation of MERT we investigate whether the optimisation of linear models is tractable in general. Previous work on finding optimal solutions in MERT (Galley and Quirk, 2011) established a worstcase complexity that was exponential in the number of sentences, in contrast we show that exponential dependence in the worst-case complexity is mainly in the number of features. Although our work is framed with respect to MERT, the convex geometric description is also applicable to other error-based training methods for linear models. We believe our analysis has important ramifications because it suggests that the current trend in building statistical machine translation systems by introducing a very large number of sparse features is inherently not robust.

Highlights

The linear model of Statistical Machine Translation (SMT) (Och and Ney, 2002) casts translation as a search for translation hypotheses under a linear combination of weighted features: a source language sentence f is translated as e(f ; w) = argmax{wh(e, f )} (1)e where translation scores are a linear combination of the D × 1 feature vector h(e, f ) ∈ RD under the 1 × D model parameter vector w.Convex geometry (Ziegler, 1995) is the mathematics of such linear equations presented as the study of convex polytopes
Using this geometric representation of minimum error rate training (MERT) we investigate whether the optimisation of linear models is tractable in general
We use convex geometry to show that the behaviour of training methods such as MERT (Och, 2003; Macherey et al, 2008), MIRA (Crammer et al, 2006), Pairwise Ranking Optimisation (PRO) (Hopkins and May, 2011), and others converge with a high feature dimension

Summary

Introduction

E where translation scores are a linear combination of the D × 1 feature vector h(e, f ) ∈ RD under the 1 × D model parameter vector w. We use convex geometry to show that the behaviour of training methods such as MERT (Och, 2003; Macherey et al, 2008), MIRA (Crammer et al, 2006), PRO (Hopkins and May, 2011), and others converge with a high feature dimension. In particular we analyse how robustness decreases in linear models as feature dimension increases. In the process of building this geometric representation of linear models we discuss algorithms such as the Minkowski sum algorithm (Fukuda, 2004) and projected MERT (Section 4.2) that could be useful for designing new and more robust training algorithms for SMT and other natural language processing problems

Training Linear Models

Survey of Recent Work

Convex Geometry

Convex Geometry Fundamentals

Drawing a Normal Fan

Training Set Geometry

Computing the Minkowski Sum

Two Dimensional Projected MERT

Robustness of Linear Models

A Note on Regularisation

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Geometry of Statistical Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2015
Citations: 31	License type: cc-by

Similar Papers

Improving the Performance of Low-Resource SMT Using Neural-Inspired Sentence Generator
Nirmal Kumar ... K Mrinalini
-
Nirmal Kumar, et. al.Nirmal Kumar ... K Mrinalini
01 Feb 2018
01 Feb 2018

Approaches to Improving Corpus Quality for Statistical Machine Translation
Yu Zhou ... Cheng Zong
International Journal of Computer Processing of Languages | VOL. 23
Yu Zhou, et. al.Yu Zhou ... Cheng Zong
01 Dec 2011
International Journal of Computer Processing of Languages | VOL. 23

Max Margin Learning for Statistical Machine Translation Toward Improvement of Machine Translation Accuracy
Katsuhiko Hayashi ... Seiichi Yamamoto
Transactions of the Japanese Society for Artificial Intelligence | VOL. 25
Katsuhiko Hayashi, et. al.Katsuhiko Hayashi ... Seiichi Yamamoto
01 Jan 2009
Transactions of the Japanese Society for Artificial Intelligence | VOL. 25

Minimum error rate training based on N-best string models
W Chou ... B.H Juang
-
W Chou, et. al.W Chou ... B.H Juang
01 Jan 1992
01 Jan 1992

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Geometry of Statistical Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers