Abstract

Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)—a non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as English–French, English–Spanish, English–Greek, and English–Japanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks.

Highlights

  • Technical terms are coined in many domain on a daily basis

  • Because singular value decomposition (SVD) and negative matrix factorization (NMF) are computing low rank approximations to the matrix defined by the feature vectors, the correlation does not improve when we have reached the rank of the data matrix

  • In the larger 10,000 dimensional setting depicted in Figs 3 and 4, we see that Kendall’s τ drops for SVD and NMF methods when the dimensionality is increased beyond 300 dimensions

Read more

Summary

Introduction

Technical terms are coined in many domain on a daily basis. In specialized domains such as medicine, technical terms are often first proposed in English and later translated into other languages. Finding proper translations for technical terms is an important factor that expedites the technical knowledge across languages. Bilingual dictionaries for technical terms play an important role in both manual [1] and machine translation [2] approaches. Only a small fraction of the technical terms proposed in English are translated into other languages, which is problematic for machine translation systems that require bilingual term lexicons. The unbalanced representation of languages other than English in UMLS demonstrates the severity of the problem of technical term translation

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.