Abstract

The recognition and translation of organization names (ONs) is challenging due to the complex structures and high variability involved. ONs consist not only of common generic words but also names, rare words, abbreviations and business and industry jargon. ONs are a sub-class of named entity (NE) phrases, which convey key information in text. As such, the correct translation of ONs is critical for machine translation and cross-lingual information retrieval. The existing Chinese–Uyghur neural machine translation systems have performed poorly when applied to ON translation tasks. As there are no publicly available Chinese–Uyghur ON translation corpora, an ON translation corpus is developed here, which includes 191,641 ON translation pairs. A word segmentation approach involving characterization, tagged characterization, byte pair encoding (BPE) and syllabification is proposed here for ON translation tasks. A recurrent neural network (RNN) attention framework and transformer are adapted here for ON translation tasks with different sequence granularities. The experimental results indicate that the transformer model not only outperforms the RNN attention model but also benefits from the proposed word segmentation approach. In addition, a Chinese–Uyghur ON translation system is developed here to automatically generate new translation pairs. This work significantly improves Chinese–Uyghur ON translation and can be applied to improve Chinese–Uyghur machine translation and cross-lingual information retrieval. It can also easily be extended to other agglutinative languages.

Highlights

  • In recent years, neural-network-based machine translation has made continual progress for high-resource languages

  • A novel named entity translation approach was introduced for Chinese–Uyghur organization name dataset translation and pair construction

  • named entity (NE) translation corpus, a Chinese–Uyghur organization names (ONs) dataset was constructed as part of this study

Read more

Summary

Introduction

Neural-network-based machine translation has made continual progress for high-resource languages. Named entities (NEs) describe key information in text, the incorrect translation of which can be problematic. As training texts include both semantic translation and transliteration data, neural networks can be trained using large quantities of parallel sentence pairs. The resulting models may learn specific translation rules from the named entities included in the texts, thereby developing a capacity for translating NEs, even if the NE translation model is not implemented separately [2]. Low-resource texts typically include a scarce bilingual corpus, exhibiting smaller NE quantities and a lower usage frequency. The translation performance of low-resource machine translation systems is typically lower than that of comparable high-resource systems [3]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call