Abstract

In this paper we investigate the technique of extending the Moses Statistical Machine Translation (SMT) system default set of features using shallow linguistic information from source and target phrases. Although a typical SMT system uses a phrase table with 5 default features, most systems are scalable and support any number of additional features. We assume that linguistic information extracted from the source and target phrases can improve the overall translation quality, i. e. make the system more robust and reduce the number of instances of incorrect word choice, punctuation mistakes and other problems SMT systems are prone to. First, we build a baseline SMT system. Then we extract shallow linguistic features directly from source and target phrases of the baseline system’s phrase table. The features are precomputed and stored in the phrase table, so they can be regarded as stateless dense features. We develop and examine 19 features incorporating information from source and target phrases. We explore features commonly used in monolingual and parallel data filtering techniques. The features we investigate include source and target phrase lengths, word, number and punctuation symbol count, word frequencies according to large monolingual corpora etc. For each feature, we build and evaluate a separate SMT system. We conduct a series of experiments on the English-Russian language pair and obtain statistically significant improvements of up to 0.4 BLEU compared to baseline configuration.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.