Abstract

This work investigates neural machine translation (NMT) systems for translating English user reviews into Croatian and Serbian, two similar morphologically complex languages. Two types of reviews are used for testing the systems: IMDb movie reviews and Amazon product reviews. Two types of training data are explored: large out-of-domain bilingual parallel corpora, as well as small synthetic in-domain parallel corpus obtained by machine translation of monolingual English Amazon reviews into the target languages. Both automatic scores and human evaluation show that using the synthetic in-domain corpus together with a selected sub-set of out-of-domain data is the best option. Separated results on IMDb and Amazon reviews indicate that MT systems perform differently on different review types so that user reviews generally should not be considered as a homogeneous genre. Nevertheless, more detailed research on larger amount of different reviews covering different domains/topics is needed to fully understand these differences.

Highlights

  • Machine translation (MT) has evolved very rapidly since the emergence of neural approaches in 2015, and it is being used for different genres and domains

  • A considerable amount of work in the Computational Linguistics/Natural Language Processing community has been done on processing usergenerated content, mostly on sentiment analysis, and on different aspects of machine translation (MT)

  • We ranked out-of-domain sentences according to their similarity to user reviews, and extracted the most similar ones to combine them with the synthetic parallel corpus and train an “advanced student" model

Read more

Summary

Introduction

Machine translation (MT) has evolved very rapidly since the emergence of neural approaches in 2015, and it is being used for different genres and domains. Php focuses on the translation of TED talks, and some European projects (TraMOOC, transLectures) investigated the translation of online lectures In both cases, the text can be considered to be “formal speech", with the challenges of dealing with characteristics of spoken language and speech recognition output. We investigate Croatian and Serbian as target languages, as a case involving mid-size less-resourced morphologically rich European languages For these languages, a reasonable amount of out-ofdomain parallel data is publicly available to train an NMT system, still much lower than for “major" European languages (such as German, French, Spanish). We used the publicly available texts consisting of a selected set of English IMDb reviews and their Croatian and Serbian human translations. Neither of the test reviews has been investigated yet, and they will be made publicly available

Related work
Building NMT systems
User reviews
Out-of-domain data
Selected out-of-domain data
Experimental set-up
Comparing MT systems
Comparing Amazon and IMDb reviews
Summary and outlook

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.