A Flexible Rule-Based Approach to Learn Medical English-Chinese OOV Term Translations from the Web

Jian Qu,Akira Shimazu,Pakinee Aimmanee,Nguyen Le Ming,Cholwich Nattee,Thanaruk Theeramunkong

doi:10.1142/s1793840612400132

Abstract

Out-of-vocabulary (OOV) terms, which do not exist in most dictionaries, usually cause failures in a cross language information retrieval (CLIR) system. Most existing approaches achieve a high performance when using web-mining to translate name entity type OOV terms. However, these methods gain a low performance when they are applied to medical OOV terms because they contain non-Chinese characters which are normally ignored by existing approaches, such as symbols, Roman alphabets and Arabic numbers. This paper presents a flexible rule-based approach towards the acquisition of medical OOV term translation. Our method uses a combination of a novel rule-based pattern extraction and brute force generation to identify the part of non-Chinese characters. To cope with the time-consuming task of ranking list and human extraction of OOV term translation, this paper presents a machine learning method to select correct translations automatically. In the method, twenty-one different features for each Chinese translation candidate are extracted, and the correct Chinese translations are selected by machine learning with our newly proposed statistics filter. By testing our method with 1,654 English ICD9 medical OOV terms, our proposed method (SF+F+W+B+P+S with the base machine learning algorithm SVM) outperforms the existing methods with a recall and precision value of 83.05% and 79.72%, respectively.

Full Text