Abstract

Out-of-vocabulary (OOV) terms, which do not exist in most dictionaries, usually cause failures in a cross language information retrieval (CLIR) system. Most existing approaches achieve a high performance when using web-mining to translate name entity type OOV terms. However, these methods gain a low performance when they are applied to medical OOV terms because they contain non-Chinese characters which are normally ignored by existing approaches, such as symbols, Roman alphabets and Arabic numbers. This paper presents a flexible rule-based approach towards the acquisition of medical OOV term translation. Our method uses a combination of a novel rule-based pattern extraction and brute force generation to identify the part of non-Chinese characters. To cope with the time-consuming task of ranking list and human extraction of OOV term translation, this paper presents a machine learning method to select correct translations automatically. In the method, twenty-one different features for each Chinese translation candidate are extracted, and the correct Chinese translations are selected by machine learning with our newly proposed statistics filter. By testing our method with 1,654 English ICD9 medical OOV terms, our proposed method (SF+F+W+B+P+S with the base machine learning algorithm SVM) outperforms the existing methods with a recall and precision value of 83.05% and 79.72%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call