Spanish-language text classification for environmental evidence synthesis using multilingual pre-trained models.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Artificial intelligence (AI) is increasingly being explored as a tool to optimize and accelerate various stages of evidence synthesis. A persistent challenge in environmental evidence syntheses is that these remain predominantly monolingual (English), leading to biased results and misinforming cross-scale policy decisions. AI offers a promising opportunity to incorporate non-English language evidence in evidence syntheses screening process and help to move beyond the current monolingual focus of evidence syntheses. Using a corpus of Spanish-language peer-reviewed papers on biodiversity conservation interventions, we developed and evaluated text classifiers using supervised machine learning models. Our best-performing model achieved 100% recall meaning no relevant papers (n = 9) were missed and filtered out over 70% (n = 867) of negative documents based only on the title and abstract of each paper. The text was encoded using a pre-trained multilingual model and class-weights were used to deal with a highly imbalanced dataset (0.79%). This research therefore offers an approach to reducing the manual, time-intensive effort required for document screening in evidence syntheses-with minimal risk of missing relevant studies. It highlights the potential of multilingual large language models and class-weights to train a light-weight non-English language classifier that can effectively filter irrelevant texts, using only a small non-English language labelled corpus. Future work could build on our approach to develop a multilingual classifier that enables the inclusion of any non-English scientific literature in evidence syntheses.

Similar Papers
  • Conference Article
  • Cite Count Icon 8
  • 10.18653/v1/2021.calcs-1.20
Are Multilingual Models Effective in Code-Switching?
  • Jan 1, 2021
  • Genta Indra Winata + 5 more

Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this paper, we study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting by considering the inference speed, performance, and number of parameters to measure their practicality. We conduct experiments in three language pairs on named entity recognition and part-of-speech tagging and compare them with existing methods, such as using bilingual embeddings and multilingual meta-embeddings. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching, while using meta-embeddings achieves similar results with significantly fewer parameters.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 18
  • 10.18653/v1/k19-1030
Improving Pre-Trained Multilingual Model with Vocabulary Expansion
  • Jan 1, 2019
  • Hai Wang + 4 more

Recently, pre-trained language models have achieved remarkable success in a broad range of natural language processing tasks. However, in multilingual setting, it is extremely resource-consuming to pre-train a deep language model over large-scale corpora for each language. Instead of exhaustively pre-training monolingual language models independently, an alternative solution is to pre-train a powerful multilingual deep language model over large-scale corpora in hundreds of languages. However, the vocabulary size for each language in such a model is relatively small, especially for low-resource languages. This limitation inevitably hinders the performance of these multilingual models on tasks such as sequence labeling, wherein in-depth token-level or sentence-level understanding is essential. In this paper, inspired by previous methods designed for monolingual settings, we investigate two approaches (i.e., joint mapping and mixture mapping) based on a pre-trained multilingual model BERT for addressing the out-of-vocabulary (OOV) problem on a variety of tasks, including part-of-speech tagging, named entity recognition, machine translation quality estimation, and machine reading comprehension. Experimental results show that using mixture mapping is more promising. To the best of our knowledge, this is the first work that attempts to address and discuss the OOV issue in multilingual settings.

  • Conference Article
  • Cite Count Icon 16
  • 10.1145/3539597.3570468
Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
  • Feb 27, 2023
  • Zhiqi Huang + 2 more

Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.

  • Research Article
  • 10.3233/jifs-231485
Improving sentence representation for vietnamese natural language understanding using optimal transport
  • Dec 2, 2023
  • Journal of Intelligent & Fuzzy Systems
  • Phu Xuan-Vinh Nguyen + 3 more

Multilingual pre-trained language models have achieved impressive results on most natural language processing tasks. However, the performance is inhibited due to capacity limitations and their under-representation of pre-training data, especially for languages with limited resources. This has led to the creation of tailored pre-trained language models, in which the models are pre-trained on large amounts of monolingual data or domain specific corpus. Nevertheless, compared to relying on multiple monolingual models, utilizing multilingual models offers the advantage of multilinguality, such as generalization on cross-lingual resources. To combine the advantages of both multilingual and monolingual models, we propose KDDA - a framework that leverages monolingual models to a single multilingual model with the aim to improve sentence representation for Vietnamese. KDDA employs teacher-student framework and cross-lingual transfer that aims to adopt knowledge from two monolingual models (teachers) and transfers them into a unified multilingual model (student). Since the representations from the teachers and the student lie on disparate semantic spaces, we measure discrepancy between their distributions by using Sinkhorn Divergence - an optimal transport distance. We conduct experiments on two Vietnamese natural language understanding tasks, including machine reading comprehension and natural language inference. Experimental results show that our model outperforms other state-of-the-art models and yields competitive performances.

  • Research Article
  • Cite Count Icon 7
  • 10.1609/icwsm.v16i1.19356
Overcoming Language Disparity in Online Content Classification with Multimodal Learning
  • May 31, 2022
  • Proceedings of the International AAAI Conference on Web and Social Media
  • Gaurav Verma + 4 more

Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications.

  • Video Transcripts
  • 10.48448/smsn-9d37
Overcoming Language Disparity in Online Content Classification with Multimodal Learning
  • May 8, 2022
  • Rohit Mujumdar + 4 more

Advances in Natural Language Processing (NLP) have revolutionized the way researchers and practitioners address crucial societal problems. Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. However, the development of advanced computational techniques and resources is disproportionately focused on the English language, sidelining a majority of the languages spoken globally. While existing research has developed better multilingual and monolingual language models to bridge this language disparity between English and non-English languages, we explore the promise of incorporating the information contained in images via multimodal machine learning. Our comparative analyses on three detection tasks focusing on crisis information, fake news, and emotion recognition, as well as five high-resource non-English languages, demonstrate that: (a) detection frameworks based on pre-trained large language models like BERT and multilingual-BERT systematically perform better on the English language compared against non-English languages, and (b) including images via multimodal learning bridges this performance gap. We situate our findings with respect to existing work on the pitfalls of large language models, and discuss their theoretical and practical implications.

  • Research Article
  • Cite Count Icon 15
  • 10.1609/aaai.v35i14.17505
How Linguistically Fair Are Multilingual Pre-Trained Language Models?
  • May 18, 2021
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Monojit Choudhury + 1 more

Massively multilingual pre-trained language models, such as mBERT and XLM-RoBERTa, have received significant attention in the recent NLP literature for their excellent capability towards crosslingual zero-shot transfer of NLP tasks. This is especially promising because a large number of languages have no or very little labeled data for supervised learning. Moreover, a substantially improved performance on low resource languages without any significant degradation of accuracy for high resource languages lead us to believe that these models will help attain a fairer distribution of language technologies despite the prevalent unfair and extremely skewed distribution of resources across the world’s languages. Nevertheless, these models, and the experimental approaches adopted by the researchers to arrive at those, have been criticised by some for lacking a nuanced and thorough comparison of benefits across languages and tasks. A related and important question that has received little attention is how to choose from a set of models, when no single model significantly outperforms the others on all tasks and languages. As we discuss in this paper, this is often the case, and the choices are usually made without a clear articulation of reasons or underlying fairness assumptions. In this work, we scrutinize the choices made in previous work, and propose a few different strategies for fair and efficient model selection based on the principles of fairness in economics and social choice theory. In particular, we emphasize Rawlsian fairness, which provides an appropriate framework for making fair (with respect to languages, or tasks, or both) choices while selecting multilingual pre-trained language models for a practical or scientific set-up.

  • Research Article
  • Cite Count Icon 6
  • 10.1080/24751839.2023.2173843
Exploring zero-shot and joint training cross-lingual strategies for aspect-based sentiment analysis based on contextualized multilingual language models
  • Feb 16, 2023
  • Journal of Information and Telecommunication
  • Dang Van Thin + 3 more

Aspect-based sentiment analysis (ABSA) has attracted many researchers' attention in recent years. However, the lack of benchmark datasets for specific languages is a common challenge because of the prohibitive cost of manual annotation. The zero-shot cross-lingual strategy can be applied to solve this gap in research. Moreover, previous works mainly focus on improving the performance of supervised ABSA with pre-trained languages. Therefore, there are few to no systematic comparisons of the benefits of multilingual models in zero-shot and joint training cross-lingual for the ABSA task. In this paper, we focus on the zero-shot and joint training cross-lingual transfer task for the ABSA. We fine-tune the latest pre-trained multilingual language models on the source language, and then it is directly predicted in the target language. For the joint learning scenario, the models are trained on the combination of multiple source languages. Our experimental results show that (1) fine-tuning multilingual models achieve promising performances in the zero-shot cross-lingual scenario; (2) fine-tuning models on the combination training data of multiple source languages outperforms monolingual data in the joint training scenario. Furthermore, the experimental results indicated that choosing other languages instead of English as the source language can give promising results in the low-resource languages scenario.

  • Research Article
  • 10.11591/ijai.v14.i2.pp1597-1604
A comparative study of natural language inference in Swahili using monolingual and multilingual models
  • Apr 1, 2025
  • IAES International Journal of Artificial Intelligence (IJ-AI)
  • Hajra Faki Ali + 1 more

<span lang="EN-US">Recent advancements in large language models (LLMs) have led to opportunities for improving applications across various domains. However, existing LLMs fine-tuned for Swahili or other African languages often rely on pre-trained multilingual models, resulting in a relatively small portion of training data dedicated to Swahili. In this study, we compare the performance of monolingual and multilingual models in Swahili natural language inference tasks using the cross-lingual natural language inference (XNLI) dataset. Our research demonstrates the superior effectiveness of dedicated Swahili monolingual models, achieving an accuracy rate of 69%. These monolingual models exhibit significantly enhanced precision, recall, and F1 scores, particularly in predicting contradiction and neutrality. Overall, the findings in this article emphasize the critical importance of using monolingual models in low-resource language processing contexts, providing valuable insights for developing more efficient and tailored natural language processing systems that benefit languages facing similar resource constraints.</span>

  • Video Transcripts
  • 10.48448/qse7-hb71
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
  • Aug 1, 2021
  • Laura Pérez-Mayos + 3 more

Multilingual Transformer-based language models, usually pretrained on more than 100 languages, have been shown to achieve outstanding results in a wide range of cross-lingual transfer tasks. However, it remains unknown whether the optimization for different languages conditions the capacity of the models to generalize over syntactic structures, and how languages with syntactic phenomena of different complexity are affected. In this work, we explore the syntactic generalization capabilities of the monolingual and multilingual versions of BERT and RoBERTa. More specifically, we evaluate the syntactic generalization potential of the models on English and Spanish tests, comparing the syntactic abilities of monolingual and multilingual models on the same language (English), and of multilingual models on two different languages (English and Spanish). For English, we use the available SyntaxGym test suite; for Spanish, we introduce SyntaxGymES, a novel ensemble of targeted syntactic tests in Spanish, designed to evaluate the syntactic generalization capabilities of language models through the SyntaxGym online platform.

  • Conference Article
  • 10.5281/zenodo.6552938
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
  • May 10, 2021
  • Zenodo (CERN European Organization for Nuclear Research)
  • Laura Pérez-Mayos + 3 more

Multilingual Transformer-based language models, usually pretrained on more than 100 languages, have been shown to achieve outstanding results in a wide range of cross-lingual transfer tasks. However, it remains unknown whether the optimization for different languages conditions the capacity of the models to generalize over syntactic structures, and how languages with syntactic phenomena of different complexity are affected. In this work, we explore the syntactic generalization capabilities of the monolingual and multilingual versions of BERT and RoBERTa. More specifically, we evaluate the syntactic generalization potential of the models on English and Spanish tests, comparing the syntactic abilities of monolingual and multilingual models on the same language (English), and of multilingual models on two different languages (English and Spanish). For English, we use the available SyntaxGym test suite; for Spanish, we introduce SyntaxGymES, a novel ensemble of targeted syntactic tests in Spanish, designed to evaluate the syntactic generalization capabilities of language models through the SyntaxGym online platform.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 1
  • 10.18653/v1/2021.findings-acl.333
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
  • Jan 1, 2021
  • Laura Pérez-Mayos + 3 more

Multilingual Transformer-based language models, usually pretrained on more than 100 languages, have been shown to achieve outstanding results in a wide range of cross-lingual transfer tasks. However, it remains unknown whether the optimization for different languages conditions the capacity of the models to generalize over syntactic structures, and how languages with syntactic phenomena of different complexity are affected. In this work, we explore the syntactic generalization capabilities of the monolingual and multilingual versions of BERT and RoBERTa. More specifically, we evaluate the syntactic generalization potential of the models on English and Spanish tests, comparing the syntactic abilities of monolingual and multilingual models on the same language (English), and of multilingual models on two different languages (English and Spanish). For English, we use the available SyntaxGym test suite; for Spanish, we introduce SyntaxGymES, a novel ensemble of targeted syntactic tests in Spanish, designed to evaluate the syntactic generalization capabilities of language models through the SyntaxGym online platform.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1017/nlp.2024.28
Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments
  • Aug 27, 2024
  • Natural Language Processing
  • Koyel Ghosh + 1 more

Warning: This paper is based on hate speech detection and may contain examples of abusive/ offensive phrases.Cyberbullying, online harassment, etc., via offensive comments are pervasive across different social media platforms like ™Twitter, ™Facebook, ™YouTube, etc. Hateful comments must be detected and eradicated to prevent harassment and violence on social media. In the Natural Language Processing (NLP) domain, the most prevalent task is comment classification, which is challenging, and language models based on transformers are at the forefront of this advancement. This paper intends to analyze the performance of language models based on transformers like BERT, ALBERT, RoBERTa, and DistilBERT on the Indian hate speech datasets over binary classification. Here, we utilize the existing datasets, i.e., HASOC (Hindi and Marathi) and HS-Bangla. So, we evaluate several multilingual language models like MuRIL-BERT, XLM-RoBERTa, etc., few monolingual language models like RoBERTa-Hindi, Maha-BERT (Marathi), Bangla-BERT (Bangla), Assamese-BERT (Assamese), etc., and perform cross-lingual experiment also. For further analyses, we perform multilingual, monolingual, and cross-lingual experiments on our Hate Speech Assamese (HS-Assamese) (Indo-Aryan language family) and Hate Speech Bodo (HS-Bodo) (Sino-Tibetan language family) dataset (HS dataset version 2) also and achieved a promising result. The motivation of the cross-lingual experiment is to encourage researchers to learn about the power of the transformer. Note that no pre-trained language models are currently available for Bodo or any other Sino-Tibetan languages.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.3389/fdgth.2024.1211564
Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning.
  • Feb 26, 2024
  • Frontiers in Digital Health
  • Lifeng Han + 5 more

Clinical text and documents contain very rich information and knowledge in healthcare, and their processing using state-of-the-art language technology becomes very important for building intelligent systems for supporting healthcare and social good. This processing includes creating language understanding models and translating resources into other natural languages to share domain-specific cross-lingual knowledge. In this work, we conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three sub-tasks including (1) clinical case (CC), (2) clinical terminology (CT), and (3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) outperformed the other two extra-large language models by a large margin in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation. Our data is openly available for research purposes at: https://github.com/HECTA-UoM/ClinicalNMT.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 1
  • 10.18653/v1/2021.emnlp-main.668
Effective Fine-Tuning Methods for Cross-lingual Adaptation
  • Jan 1, 2021
  • Tao Yu + 1 more

Large scale multilingual pre-trained language models have shown promising results in zero- and few-shot cross-lingual tasks. However, recent studies have shown their lack of generalizability when the languages are structurally dissimilar. In this work, we propose a novel fine-tuning method based on co-training that aims to learn more generalized semantic equivalences as a complementary to multilingual language modeling using the unlabeled data in the target language. We also propose an adaption method based on contrastive learning to better capture the semantic relationship in the parallel data, when a few translation pairs are available. To show our method’s effectiveness, we conduct extensive experiments on cross-lingual inference and review classification tasks across various languages. We report significant gains compared to directly fine-tuning multilingual pre-trained models and other semi-supervised alternatives.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.