Cross-lingual Natural Language Processing Research Articles

Textual datasets (corpora) are crucial for the application of natural language processing (NLP) models. However, corpus creation in the medical field is challenging, primarily because of privacy issues with raw clinical data such as health records. Thus, the existing clinical corpora are generally small and scarce. Medical NLP (MedNLP) methodologies perform well with limited data availability. We present the outcomes of the Real-MedNLP workshop, which was conducted using limited and parallel medical corpora. Real-MedNLP exhibits three distinct characteristics: (1) Limited Annotated Documents: The training data comprises only a small set (approximately 100) of case reports (CRs) and radiology reports (RRs) that have been annotated. (2) Bilingually Parallel: The constructed corpora are parallel in Japanese and English. (3) Practical Tasks: The workshop addresses fundamental tasks, such as named entity recognition and applied practical tasks. We propose three tasks: named entity recognition (NER) of approximately 100 available documents (Task 1), NER based only on annotation guidelines for humans (Task 2), and clinical applications (Task 3) consisting of adverse drug effects (ADE) detection for CRs and identical case identification (CI) for RRs. Nine teams participated in this study. The best systems achieved 0.65 and 0.89 F1-scores for CRs and RRs in Task 1, whereas the top scores in Task 2 decreased by 50-70%. In Task 3, ADE reports were detected by up to 0.64 F1-score, and CI scored up to 0.96 binary accuracy. Most systems adopt medical-domain-specific pre-trained language models using data augmentation methods. Despite the challenge of limited corpus size in Tasks 1 and 2, recent approaches are promising because the partial match scores reached approximately 0.8-0.9 F1-scores. Task 3 applications revealed that the different availabilities of external language resources affected the performance per language.

Multilingual neural machine translation (MNMT) has attracted more and more attention in recent days because it can use a single neural machine translation (NMT) model to translate between multiple languages. As several languages are involved in MNMT, recent studies have shown that using part of these languages rather than all of them to train the model leads to comparable results. However, previous work on this topic mainly focuses on language clustering and features defined by linguists. The semantic relationship and language distance are not fully considered. How to select the most related language pairs to current low-resource pair to optimize the performance of MNMT is still an open question. In this paper, we propose to take language relatedness computation as a ranking problem, where features such as language distance, linguistic typological information and semantic relatedness features are incorporated into a random decision forest to improve the language relatedness evaluation (LRE) for MNMT. Since the model only focuses on monolingual LRE in general cross-lingual natural language processing tasks, we also propose two features related to machine translation (data size and bilingual relatedness) to predict the final language pairs. Experimental results on IWSLT and WMT datasets show that our proposed LRE method can achieve significant improvements compared with other models. We also conducted several groups of experiments on IWSLT and WMT datasets to further evaluate the effectiveness of the proposed method on MNMT. The results show that the MNMT model trained on language pairs predicted by the LRE method outperforms other language selection methods.

Cross-lingual Natural Language Processing Research Articles

Articles published on Cross-lingual Natural Language Processing

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop.

Language relatedness evaluation for multilingual neural machine translation

Improving the Robustness of Loanword Identification in Social Media Texts

Word-Pair Relevance Modeling with Multi-View Neural Attention Mechanism for Sentence Alignment

Exploring Implicit Semantic Constraints for Bilingual Word Embeddings

Bilingual lexicon induction from non-parallel corpora

Exploring Implicit Semantic Constraints for Bilingual Word Embeddings

A neural generative autoencoder for bilingual word embeddings

Introduction to the Special Issue on Cross-Language Algorithms and Applications

A statistical approach to crosslingual natural language tasks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cross-lingual Natural Language Processing Research Articles

Articles published on Cross-lingual Natural Language Processing

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop.

Language relatedness evaluation for multilingual neural machine translation

Improving the Robustness of Loanword Identification in Social Media Texts

Word-Pair Relevance Modeling with Multi-View Neural Attention Mechanism for Sentence Alignment

Exploring Implicit Semantic Constraints for Bilingual Word Embeddings

Bilingual lexicon induction from non-parallel corpora

Exploring Implicit Semantic Constraints for Bilingual Word Embeddings

A neural generative autoencoder for bilingual word embeddings

Introduction to the Special Issue on Cross-Language Algorithms and Applications

A statistical approach to crosslingual natural language tasks