Abstract
Most modern statistical machine translation systems are based on linear statistical models. One extremely effective method for estimating the model parameters is minimum error rate training (MERT), which is an efficient form of line optimisation adapted to the highly nonlinear objective functions used in machine translation. We describe a polynomial-time generalisation of line optimisation that computes the error surface over a plane embedded in parameter space. The description of this algorithm relies on convex geometry, which is the mathematics of polytopes and their faces. Using this geometric representation of MERT we investigate whether the optimisation of linear models is tractable in general. Previous work on finding optimal solutions in MERT (Galley and Quirk, 2011) established a worstcase complexity that was exponential in the number of sentences, in contrast we show that exponential dependence in the worst-case complexity is mainly in the number of features. Although our work is framed with respect to MERT, the convex geometric description is also applicable to other error-based training methods for linear models. We believe our analysis has important ramifications because it suggests that the current trend in building statistical machine translation systems by introducing a very large number of sparse features is inherently not robust.
Highlights
The linear model of Statistical Machine Translation (SMT) (Och and Ney, 2002) casts translation as a search for translation hypotheses under a linear combination of weighted features: a source language sentence f is translated as e(f ; w) = argmax{wh(e, f )} (1)e where translation scores are a linear combination of the D × 1 feature vector h(e, f ) ∈ RD under the 1 × D model parameter vector w.Convex geometry (Ziegler, 1995) is the mathematics of such linear equations presented as the study of convex polytopes
Using this geometric representation of minimum error rate training (MERT) we investigate whether the optimisation of linear models is tractable in general
We use convex geometry to show that the behaviour of training methods such as MERT (Och, 2003; Macherey et al, 2008), MIRA (Crammer et al, 2006), Pairwise Ranking Optimisation (PRO) (Hopkins and May, 2011), and others converge with a high feature dimension
Summary
E where translation scores are a linear combination of the D × 1 feature vector h(e, f ) ∈ RD under the 1 × D model parameter vector w. We use convex geometry to show that the behaviour of training methods such as MERT (Och, 2003; Macherey et al, 2008), MIRA (Crammer et al, 2006), PRO (Hopkins and May, 2011), and others converge with a high feature dimension. In particular we analyse how robustness decreases in linear models as feature dimension increases. In the process of building this geometric representation of linear models we discuss algorithms such as the Minkowski sum algorithm (Fukuda, 2004) and projected MERT (Section 4.2) that could be useful for designing new and more robust training algorithms for SMT and other natural language processing problems
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.