Abstract
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.
Highlights
The term, related languages, refers to languages that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al, 2016)
We propose use of Byte Pair Encoding (BPE) (Gage, 1994; Sennrich et al, 2016), a encoding method inspired from text compression literature, to learn basic translation units for translation between related languages
We show that BPE units modestly outperform orthographic syllable units (Kunchukuttan and Bhattacharyya, 2016b), the best performing basic unit for translation between related languages, resulting in up to 11% improvement in BLEU score
Summary
The term, related languages, refers to languages that exhibit lexical and structural similarities on account of sharing a common ancestry or being in contact for a long period of time (Bhattacharyya et al, 2016). Prolonged contact leads to convergence of linguistic properties even if the languages are not related by ancestry and could lead to the formation of linguistic areas (Thomason, 2000). There is substantial government, commercial and cultural communication among people speaking related languages (Europe, India and SouthEast Asia being prominent examples and linguistic regions in Africa possibly in the future). As these regions integrate more closely and move to a digital society, translation between related languages is becoming an important requirement. It is important to leverage the relatedness of these languages to build goodquality statistical machine translation (SMT) systems given the lack of parallel corpora
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.