Abstract
Out-of-vocabulary (OOV) terms, which do not exist in most dictionaries, usually cause failures in a cross language information retrieval (CLIR) system. Most existing approaches achieve a high performance when using web-mining to translate name entity type OOV terms. However, these methods gain a low performance when they are applied to medical OOV terms because they contain non-Chinese characters which are normally ignored by existing approaches, such as symbols, Roman alphabets and Arabic numbers. This paper presents a flexible rule-based approach towards the acquisition of medical OOV term translation. Our method uses a combination of a novel rule-based pattern extraction and brute force generation to identify the part of non-Chinese characters. To cope with the time-consuming task of ranking list and human extraction of OOV term translation, this paper presents a machine learning method to select correct translations automatically. In the method, twenty-one different features for each Chinese translation candidate are extracted, and the correct Chinese translations are selected by machine learning with our newly proposed statistics filter. By testing our method with 1,654 English ICD9 medical OOV terms, our proposed method (SF+F+W+B+P+S with the base machine learning algorithm SVM) outperforms the existing methods with a recall and precision value of 83.05% and 79.72%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Computer Processing of Languages
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.