Learning variable length units for SMT between related languages via Byte Pair Encoding

Anoop Kunchukuttan,Pushpak Bhattacharyya

doi:10.18653/v1/w17-4102

Abstract

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.

Highlights

The term, related languages, refers to languages that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al, 2016)
We propose use of Byte Pair Encoding (BPE) (Gage, 1994; Sennrich et al, 2016), a encoding method inspired from text compression literature, to learn basic translation units for translation between related languages
We show that BPE units modestly outperform orthographic syllable units (Kunchukuttan and Bhattacharyya, 2016b), the best performing basic unit for translation between related languages, resulting in up to 11% improvement in BLEU score

Summary

Introduction

The term, related languages, refers to languages that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al, 2016). Prolonged contact leads to convergence of linguistic properties even if the languages are not related by ancestry and could lead to the formation of linguistic areas (Thomason, 2000). There is substantial government, commercial and cultural communication among people speaking related languages (Europe, India and SouthEast Asia being prominent examples and linguistic regions in Africa possibly in the future). As these regions integrate more closely and move to a digital society, translation between related languages is becoming an important requirement. It is important to leverage the relatedness of these languages to build goodquality statistical machine translation (SMT) systems given the lack of parallel corpora

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Learning variable length units for SMT between related languages via Byte Pair Encoding

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2017
Citations: 52	License type: cc-by

Similar Papers

Orthographic Syllable as basic unit for SMT between Related Languages
Anoop Kunchukuttan ... Pushpak Bhattacharyya
-
Anoop Kunchukuttan, et. al.Anoop Kunchukuttan ... Pushpak Bhattacharyya
01 Jan 2015
01 Jan 2015

The Development of Graphic Representation in Abugida Writing: The Akshara’s Grammar
Liudmila Fedorova
Lingua Posnaniensis | VOL. 55
Liudmila FedorovaLiudmila Fedorova
01 Dec 2013
Lingua Posnaniensis | VOL. 55

Controlling byte pair encoding for neural machine translation
Alfred John Tacorda ... Rachel Edita Roxas
-
Alfred John Tacorda, et. al.Alfred John Tacorda ... Rachel Edita Roxas
01 Dec 2017
01 Dec 2017

A comparative study of neural machine translation models for Turkish language
Özgür Özdemir ... Emre Salih Akın
Journal of Intelligent & Fuzzy Systems | VOL. 42
Özgür Özdemir, et. al.Özgür Özdemir ... Emre Salih Akın
02 Feb 2022
Journal of Intelligent & Fuzzy Systems | VOL. 42

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Learning variable length units for SMT between related languages via Byte Pair Encoding

Abstract

Highlights

Summary

Talk to us

Similar Papers