Integrating vision and language: a novel approach for translation of low-resource Indic languages

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Cross-lingual learning provides an excellent chance for knowledge transfer across multiple languages. However, the substantial resource disparity between high- and low-resource languages creates considerable issues. This study focuses on two Indic language families, Indo-Aryan, and Dravidian, as well as a definitely endangered low-resource language that often lacks the extensive training data available in high-resource languages, such as English. We present a unique approach termed Resource-Aware Multimodal Translation (RAMT), which combines large language models with vision-based character recognition to improve translation efficacy across a range of resource levels. RAMT uses the Continuous Wavelet Transform to translate low-resource text into a spatial representation, enabling a plug-and-play training process. This method streamlines the training process across multiple languages, reducing reliance on large datasets and enhancing model portability. In addition, our method captures sequential dependencies and spatial properties in the text, which improves stroke extraction and inter-character interactions. Empirical assessments of seven languages demonstrate considerable gains in both performance and processing speed, demonstrating RAMT’s usefulness in bridging the resource gap in cross-lingual applications. Our findings demonstrate that this integrated technique promotes more equitable language processing solutions, paving the way for improved access and comprehension in low-resource linguistic environments.

Similar Papers
  • Research Article
  • 10.1145/3783981
How Much Data in Low-resource Indian Languages is "Sufficient' for Transfer Learning: A Comparative Study for POS Annotation
  • Dec 12, 2025
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Mohit Raj + 1 more

Recent advances in machine learning and deep learning have demonstrated the applicability and utility of cross-lingual, transfer learning methods in low and zero-resource scenarios. We explore the applicability of transfer learning methods from pre-trained models in zero-shot and few-shot scenarios for part-of-speech tagging. We report the results of an ablation study to understand the impact of training data size in low-resource languages on the system’s performance. Since building or augmenting datasets for low-resource languages is tricky, costly and a lot of time not feasible, the study provides valuable insights into the expected relative data requirements for both the high-resource language (the source language for transfer) and the low-resource language and the kind of performance boost one could expect when one is planning to use transfer learning for low-resource languages. The study is conducted with Hindi as the high-resource language and the three related languages - Magahi, Bhojpuri and Braj - as extremely low-resource languages. Overall, the study addresses four broad research questions: (a) How much data in the low-resource as well as high-resource language is “sufficient” for attaining optimum performance in a downstream task like part-of-speech annotation, and is there any specific advantage for low-resource language if we use multilingual data during fine-tuning? (b) Do different multilingual pre-trained models, specifically multilingual-BERT, multilingual-DistilBERT, XLM-RoBERTa, and MuRIL, offer any significant advantage in terms of dataset requirements for attaining an optimum performance in Indian languages? (c) In the case of multiple closely-related low-resource languages, does distributing the dataset across multiple languages result in a performance comparable to that of a system trained on a single language? (d) What is the impact of the typological similarity of the languages on the dataset requirement for successful transfer learning?

  • Research Article
  • Cite Count Icon 8
  • 10.1145/3689735
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
  • Oct 8, 2024
  • Proceedings of the ACM on Programming Languages
  • Federico Cassano + 9 more

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

  • Research Article
  • 10.55041/ijsrem7648
Multilingual NLP: Techniques for Creating Models that Understand and Generate Multiple Languages with Minimal Resources
  • Dec 30, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Gaurav Kashyap

Models that can process human language in a variety of applications have been developed as a result of the quick development of natural language processing (NLP). Scaling NLP technologies to support multiple languages with minimal resources is still a major challenge, even though many models work well in high-resource languages. By developing models that can comprehend and produce text in multiple languages, especially those with little linguistic information, multilingual natural language processing (NLP) seeks to overcome this difficulty. This study examines the methods used in multilingual natural language processing (NLP), such as data augmentation, transfer learning, and multilingual pre-trained models. It also talks about the innovations and trade-offs involved in developing models that can effectively handle multiple languages with little effort. Many low-resource languages have been underserved by the quick advances in natural language processing, which have mostly benefited high-resource languages. The methods for creating multilingual NLP models that can efficiently handle several languages with little resource usage are examined in this paper. We discuss unsupervised morphology-based approaches to expand vocabularies, the importance of community involvement in low-resource language technology, and the limitations of current multilingual models. With the creation of strong language models capable of handling a variety of tasks, the field of natural language processing has advanced significantly in recent years. But not all languages have benefited equally from the advancements, with high-resource languages like English receiving disproportionate attention. [9] As a result, there are huge differences in the performance and accessibility of natural language processing (NLP) systems for the languages spoken around the world, many of which are regarded as low-resource. Researchers have looked into a number of methods for developing multilingual natural language processing (NLP) models that can comprehend and produce text in multiple languages with little effort in order to rectify this imbalance. Using unsupervised morphology-based techniques to increase the vocabulary of low-resource languages is one promising strategy. Keywords: Multilingual NLP, Low-resource Languages, Morphology, Vocabulary Expansion, Creole Languages

  • Research Article
  • 10.1016/j.dib.2025.112005
MDER-MA: A multimodal dataset for emotion recognition in low-resource Moroccan Arabic language
  • Aug 25, 2025
  • Data in Brief
  • Soufiyan Ouali + 1 more

MDER-MA: A multimodal dataset for emotion recognition in low-resource Moroccan Arabic language

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.1109/access.2022.3141200
Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
  • Jan 1, 2022
  • IEEE Access
  • Kurniawati Azizah + 1 more

Deep neural network (DNN)-based systems generally require large amounts of training data, so they have data scarcity problems in low-resource languages. Recent studies have succeeded in building zero-shot multi-speaker DNN-based TTS on high-resource languages, but they still have unsatisfactory performance on unseen speakers. This study addresses two main problems: overcoming the problem of data scarcity in the DNN-based TTS on low-resource languages and improving the performance of zero-shot speaker adaptation for unseen speakers. We propose a novel multi-stage transfer learning strategy using a partial network-based deep transfer learning to overcome the low-resource problem by utilizing pre-trained monolingual single-speaker TTS and d-vector speaker encoder on a high-resource language as the source domain. Meanwhile, to improve the performance of zero-shot speaker adaptation, we propose a new TTS model that incorporates an explicit style control from the target speaker for TTS conditioning and an utterance-level speaker reconstruction loss during TTS training. We use publicly available speech datasets for experiments. We show that our proposed training strategy is able to effectively train the TTS models using a limited amount of training data of low-resource target languages. The models trained using the proposed transfer learning successfully produce intelligible natural speech sounds, while in contrast using standard training fails to make the models synthesize understandable speech. We also demonstrate that our proposed style encoder network and speaker reconstruction loss significantly improves speaker similarity in zero-shot speaker adaptation task compared to the baseline model. Overall, our proposed TTS model and training strategy has succeeded in increasing the speaker cosine similarity of the synthesized speech on the unseen speakers test set by 0.468 and 0.279 in native and foreign languages respectively.

  • Research Article
  • Cite Count Icon 14
  • 10.1145/3314945
Multi-Round Transfer Learning for Low-Resource NMT Using Multiple High-Resource Languages
  • May 21, 2019
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Mieradilijiang Maimaiti + 3 more

Neural machine translation (NMT) has made remarkable progress in recent years, but the performance of NMT suffers from a data sparsity problem since large-scale parallel corpora are only readily available for high-resource languages (HRLs). In recent days, transfer learning (TL) has been used widely in low-resource languages (LRLs) machine translation, while TL is becoming one of the vital directions for addressing the data sparsity problem in low-resource NMT. As a solution, a transfer learning method in NMT is generally obtained via initializing the low-resource model (child) with the high-resource model (parent). However, leveraging the original TL to low-resource models is neither able to make full use of highly related multiple HRLs nor to receive different parameters from the same parents. In order to exploit multiple HRLs effectively, we present a language-independent and straightforward multi-round transfer learning (MRTL) approach to low-resource NMT. Besides, with the intention of reducing the differences between high-resource and low-resource languages at the character level, we introduce a unified transliteration method for various language families, which are both semantically and syntactically highly analogous with each other. Experiments on low-resource datasets show that our approaches are effective, significantly outperform the state-of-the-art methods, and yield improvements of up to 5.63 BLEU points.

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/icassp.2018.8462083
Analysis of Multilingual Blstm Acoustic Model on Low and High Resource Languages
  • Apr 1, 2018
  • Martin Karafidt + 5 more

The paper provides an analysis of automatic speech recognition systems (ASR) based on multilingual BLSTM, where we used multi-task training with separate classification layer for each language. The focus is on low resource languages, where only a limited amount of transcribed speech is available. In such scenario, we found it essential to train the ASR systems in a multilingual fashion and we report superior results obtained with pre-trained multilingual BLSTM on this task. The high resource languages are also taken into account and we show the importance of language richness for multilingual training. Next, we present the performance of this technique as a function of amount of target language data. The importance of including context information into BLSTM multilingual systems is also stressed, and we report increased resilience of large NNs to overtraining in case of multi-task training.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 65
  • 10.1007/s10590-017-9203-5
Neural machine translation for low-resource languages without parallel corpora
  • Nov 7, 2017
  • Machine Translation
  • Alina Karakanta + 2 more

The problem of a total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. We describe a language-independent method to enable machine translation between a low-resource language (LRL) and a third language, e.g. English. We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language, but there is ample training data between a closely-related high-resource language (HRL) and the third language. We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration. The transliteration models are trained on transliteration pairs extracted from Wikipedia article titles. Then, we automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. Our method achieves significant improvements in translation quality, close to the results that can be achieved by a general purpose neural machine translation system trained on a significant amount of parallel data. Moreover, the method does not rely on the existence of any parallel data for training, but attempts to bootstrap already existing resources in a related language.

  • Research Article
  • Cite Count Icon 9
  • 10.5281/zenodo.3525486
Adapting Multilingual Neural Machine Translation to Unseen Languages
  • Oct 30, 2019
  • Surafel M Lakew + 4 more

<p>Multilingual Neural Machine Translation (MNMT) for low- resource languages (LRL) can be enhanced by the presence of related high-resource languages (HRL), but the relatedness of HRL usually relies on predefined linguistic assumptions about language similarity. Recently, adapting MNMT to a LRL has shown to greatly improve performance. In this work, we explore the problem of adapting an MNMT model to an unseen LRL using data selection and model adapta- tion. In order to improve NMT for LRL, we employ perplexity to select HRL data that are most similar to the LRL on the basis of language distance. We extensively explore data selection in popular multilingual NMT settings, namely in (zero-shot) translation, and in adaptation from a multilingual pre-trained model, for both directions (LRL↔en). We further show that dynamic adaptation of the model’s vocabulary results in a more favourable segmentation for the LRL in comparison with direct adaptation. Experiments show re- ductions in training time and significant performance gains over LRL baselines, even with zero LRL data (+13.0 BLEU), up to +17.0 BLEU for pre-trained multilingual model dynamic adaptation with related data selection. Our method outperforms current approaches, such as massively multilingual models and data augmentation, on four LRL.</p>

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.procs.2023.01.242
A Voyage on Neural Machine Translation for Indic Languages
  • Jan 1, 2023
  • Procedia Computer Science
  • Shailashree K Sheshadri + 2 more

A Voyage on Neural Machine Translation for Indic Languages

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/iccit57492.2022.10055705
Aspect-Based Sentiment Analysis of Bangla Comments on Entertainment Domain
  • Dec 17, 2022
  • Nasrin Sultana + 3 more

Low-resource natural language processing is getting more attention nowadays. Aspect-Based Sentiment Analysis (ABSA) from a high-resource language such as English becomes unchallenging because of sufficient datasets and experimentation tools. However, Aspect-Based Sentiment Analysis from low-resource languages such as Bangla is quite hard. So, many researchers are investing their time and knowledge in low-resource natural language processing. In this paper, we are proposing a Bangla Aspect-Based Sentiment Analysis model using Bangla natural language processing. We have collected 4012 Bangla text comments related to cricket, drama, movie, and music from YouTube. We have applied some very prominent supervised machine learning techniques such as Support Vector Classifier (SVC), Random Forest (RF), and Linear Regression (LR). We have achieved more than 75% accuracy in classifying positive, negative, and neutral sentiments and 80% accuracy in extracting aspects from Bangla texts. Finally, we used publicly available datasets to test our proposed model's generalizability. Furthermore, we find that our proposed approach surpasses earlier related research.

  • Conference Article
  • 10.1109/icassp43922.2022.9746120
Wasserstein Cross-Lingual Alignment For Named Entity Recognition
  • May 23, 2022
  • Rui Wang + 1 more

Supervised training of Named Entity Recognition (NER) models generally require large amounts of annotations, which are hardly available for less widely used (low resource) languages, e.g., Armenian and Dutch. Therefore, it will be desirable if we could leverage knowledge extracted from a high resource language (source), e.g., English, so that NER models for the low resource languages (target) could be trained more efficiently with less cost associated with annotations. In this paper, we study cross-lingual alignment for NER, an approach for transferring knowledge from high-to low-resource languages, via the alignment of token embeddings between different languages. Specifically, we propose to align by minimizing the Wasserstein distance between the contextualized token embeddings from source and target languages. Experimental results show that our method yields improved performance over existing works for cross-lingual alignment in NER tasks.

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-3-030-22354-0_17
Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions
  • Jun 21, 2019
  • Ebtesam H Almansor + 2 more

Rapid growth in internet technology lead to increase the usage of social media platforms which make communication between users easier. Through the communication users used their daily languages which considered as non-standard language. The non-slandered text contains lots of noise, such as abbreviations, slang which used more in English languages and dialect words which are widely used in Arabic language. These texts face challenging using any natural language processing tools. Therefore, these texts need to be treated and transferred to be similar to their standard form. According to that the normalization and translation approach have been used to transfer the informal text. However, using these approach need large label or parallel datasets. While high resource languages such as English have enough parallel datasets, low resource languages such as Arabic is lack of enough parallel dataset. Therefore, in this paper we focus on the Arabic and Arabic dialects as a low resource language in the era of transferring non-stander text using normalization and translation approach.

  • Research Article
  • Cite Count Icon 9
  • 10.1148/radiol.241736
Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages.
  • Dec 1, 2024
  • Radiology
  • Aymen Meddeb + 23 more

Background High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair. © RSNA, 2024 Supplemental material is available for this article.

  • Research Article
  • Cite Count Icon 34
  • 10.26599/tst.2020.9010029
Enriching the transfer learning with pre-trained lexicon embedding for low-resource neural machine translation
  • Feb 1, 2022
  • Tsinghua Science and Technology
  • Mieradilijiang Maimaiti + 3 more

Most State-Of-The-Art (SOTA) Neural Machine Translation (NMT) systems today achieve outstanding results based only on large parallel corpora. The large-scale parallel corpora for high-resource languages is easily obtainable. However, the translation quality of NMT for morphologically rich languages is still unsatisfactory, mainly because of the data sparsity problem encountered in Low-Resource Languages (LRLs). In the low-resource NMT paradigm, Transfer Learning (TL) has been developed into one of the most efficient methods. It is difficult to train the model on high-resource languages to include the information in both parent and child models, as well as the initially trained model that only contains the lexicon features and word embeddings of the parent model instead of the child languages feature. In this work, we aim to address this issue by proposing the language-independent Hybrid Transfer Learning (HTL) method for LRLs by sharing lexicon embedding between parent and child languages without leveraging back translation or manually injecting noises. First, we train the High-Resource Languages (HRLs) as the parent model with its vocabularies. Then, we combine the parent and child language pairs using the oversampling method to train the hybrid model initialized by the previously parent model. Finally, we fine-tune the morphologically rich child model using a hybrid model. Besides, we explore some exciting discoveries on the original TL approach. Experimental results show that our model consistently outperforms five SOTA methods in two languages Azerbaijani (Az) and Uzbek (Uz). Meanwhile, our approach is practical and significantly better, achieving improvements of up to 4.94 and 4.84 BLEU points for low-resource child languages Az ! Zh and Uz ! Zh, respectively.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.