Abstract

Generating the English transliteration of a name written in a foreign script is an important and challenging step in multilingual knowledge acquisition and information extraction. Existing approaches to transliteration generation require a large (>5000) number of training examples. This difficulty contrasts with transliteration discovery, a somewhat easier task that involves picking a plausible transliteration from a given list. In this work, we present a bootstrapping algorithm that uses constrained discovery to improve generation, and can be used with as few as 500 training examples, which we show can be sourced from annotators in a matter of hours. This opens the task to languages for which large number of training examples are unavailable. We evaluate transliteration generation performance itself, as well the improvement it brings to cross-lingual candidate generation for entity linking, a typical downstream task. We present a comprehensive evaluation of our approach on nine languages, each written in a unique script.

Highlights

  • Transliteration is the process of transducing names from one writing system to another (e.g., ओबामा in Devanagari to Obama in Latin script) while preserving their pronunciation (Knight and Graehl, 1998; Karimi et al, 2011)

  • Existing transliteration generation models require supervision in the form of source-target name pairs (≈5-10k), which are often collected from names in Wikipedia inter-language links (Irvine et al, 2010)

  • We show the positive impact that our approach has on a downstream task, by evaluating its contribution to candidate generation for Tigrinya and Macedonian entity linking (§8.2)

Read more

Summary

Introduction

Transliteration is the process of transducing names from one writing system to another (e.g., ओबामा in Devanagari to Obama in Latin script) while preserving their pronunciation (Knight and Graehl, 1998; Karimi et al, 2011). Two tasks feature prominently in the transliteration literature: generation (Knight and Graehl, 1998) which involves producing an appropriate transliteration for a given word in an open-ended way, and discovery This work develops transliteration generation approaches for low-resource languages. Existing transliteration generation models require supervision in the form of source-target name pairs (≈5-10k), which are often collected from names in Wikipedia inter-language links (Irvine et al, 2010). A model that requires 50k name pairs as supervision can only support 6 languages, while one that just needs 500 could support 56. For a model to be widely applicable, it must function in low-resource settings

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.