Abstract

Subword segmentation plays an important role in Tibetan neural machine translation (NMT). The structure of Tibetan words consists of two levels. First, words consist of a sequence of syllables, and then a syllable consists of a sequence of characters. According to this special word structure, we propose two methods for Tibetan subword segmentation, namely syllable-based and character-based methods. The former generates subwords based on the Tibetan syllables, and the latter is based on Tibetan characters. In addition, we carry out experiments with these two subword segmentation methods on low-resource Tibetan-to-Chinese NMT, respectively. The experimental results show that both of them can improve translation performance, in which the subword segmentation based on character sequences can achieve better results. Overall, our proposed character-based subword segmentation is more simple and effective. Moreover, it can achieve better experimental results without paying much attention to the linguistic features of Tibetan.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.