Abstract
BackgroundMost of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.ResultsWe show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.ConclusionsWe have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.
Highlights
Most of the institutional and research information in the biomedical domain is available in the form of English text
They note that large parallel corpora in the biomedical domain are not readily available, which limits the opportunities for quality automatic translations
We apply this method to two language pairs, English/Spanish (EN/ ES) and English/French (EN/FR), and evaluate the quality of the resources extracted in two ways: first by direct analysis of the extracted data, and by applying the data to train statistical machine translation models used to translate biomedical text
Summary
We show the results obtained during the development of the multi-lingual corpus. This indicates that training a SMT system on biomedical domain-specific corpora vs out-of-domain corpora improves the performance of translation for domain-specific texts. The values for fluency are higher for the Spanish vs French set This partially correlates to the results obtained in the automatic assessment in which the BLEU metric presented higher values as well. Even though there is a larger corpus available, the BLEU scores obtained for the translation of abstract sentences in the language pairs involving French are lower than for the translation of abstract. The French set has a larger number of sentences which covers more vocabulary items
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have