Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Research Article
  • 10.1145/3816029
Mitigating Implicit Bias in Chinese Toxic Speech Detection via Unbiased Contrastive Learning
  • May 11, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Xuan Feng + 4 more

Mitigating human-like biases and social stereotypes in pre-trained language models (PLMs) has become a crucial task in Chinese toxic speech detection. While PLMs have achieved state-of-the-art results in mitigating explicit bias (e.g., bias on sensitive attribute words), the study of implicit bias (e.g., bias against specific demographic groups) is still under-explored. Besides, existing debiasing methods focus on the trade-off between debiasing efficiency and model performance, while ignoring robustness against noisy data. Therefore, a debiasing method that effectively reduces biases while maintaining robustness against noisy data is needed. In this paper, we propose Unbiased Contrastive Learning (UCL), which can mitigate explicit and implicit bias while maintaining robustness to noisy data. Specifically, we first analyze the bias representation problem constrained by contrastive learning objective and implement unbiased contrastive objective for learning unbiased text representations to mitigate explicit and implicit biases in Chinese toxic speech detection tasks. UCL inherits the idea of supervised contrastive learning, which encourages representations of the same sensitive attribute to be closer than those of different sensitive attributes and ensures unbiasedness by penalizing the sensitive attribute information contained in the representations. Furthermore, we design conditional normalization to reduce biased classification caused by the imbalanced distribution of demographic groups in the data. Experimental results on Chinese and English datasets show that the proposed method outperforms the state-of-the-art methods and achieves the competitive performance.

  • New
  • Research Article
  • 10.1145/3813805
An Efficient Hybrid Deep Learning Approach for Translating Sanskrit Shlokas into Malayalam with Linguistic Preprocessing
  • May 5, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Sreedeepa H S + 1 more

Machine translation has increasingly shifted toward Neural Machine Translation (NMT) because of its ability to handle input and output sequences of varying lengths. The incorporation of attention mechanisms in NMT systems enables the model to focus on the most relevant parts of the source sentence, rather than relying solely on a fixed representation of the entire input. While NMT improves translation quality by addressing long-range dependencies and contextual understanding, it also requires a large parallel corpus for training, which is a challenge for languages with less resources. The main focus of this research is to give solution for the unique challenges of translating Ayurvedic texts using NMT. Ayurvedic texts have collection of special and scientific words related to medicines and treatments. This makes the translation process more complex and needs very efficient approach for accurate translations. Also, the content of ayurvedic text books is in the form shlokas which is formed using very complex and compound words. In order to simplify the translation process efficiently this work uses a sandhi splitter module and an Anvaya Generator/ word reordering module. In order to develop NMT system for low resource language pair Sanskrit-Malayalam, there is a need of developing a parallel corpus especially for Ayurvedic text books. Also, as the NMT model is proposed for translation it requires a minimum amount of parallel data in the corpus. So, a number of general domain Sanskrit text books with verses, called shlokas, were also considered for developing parallel corpora. The authors developed a parallel corpus for Anvaya Generator, sandhi splitter and translation. Mainly four NMT models were developed trained and tested especially for shlokas as input. The two models are basic transformer model with attention and an encoder-decoder model using Long-Short term Memory (LSTM) with attention. The other two are developed by adding two modules called Sandhi Splitter and Anvaya Generator in the pre-processing stages of the earlier models- Transformer based model and LSTM based model. The limitations of low resources and richness in grammatical structure of Sanskrit- Malayalam language pair are overcome by the concepts of deep learning and the additional modules used in preprocessing stages for developing the models. The models were tested with and without sandhi splitter and Anvaya Generator modules. The transformer-based model integrated with sandhi splitter and Anvaya Generator system achieved a higher average BLEU score of 73.11 and a uni-gram BLEU score of 76.93 for Sanskrit verses to Malayalam translation.

  • New
  • Research Article
  • 10.1145/3813800
Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages
  • May 5, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Leena G Pillai + 3 more

This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI’s Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. The development of speech recognition for Malasar language faces significant barriers due to its lack of written form and the absence of digital speech datasets needed for training. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first built an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

  • New
  • Research Article
  • 10.1145/3811819
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
  • Apr 24, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Weihua Zheng + 7 more

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages and the monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, improve cross-lingual performance but often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task in the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM’s embedding space, improving both language generation and comprehension. We further introduce a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves up to 11.9 BLEU score gains in MT, an increase of 6.72 in CLQA BERTScore-Precision and more than a 5% increase in CLNLU accuracy over strong multilingual baselines. Our findings highlight the potential of embedding cross-lingual objectives into pre-training, improving multilingual LLMs.

  • New
  • Research Article
  • 10.1145/3793249
Improving Chinese Text Recognition with Multi-Granularity Features and Vision-Language Reasoning
  • Apr 14, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Tao Li + 1 more

Accurate Chinese text recognition (CTR) is vital for applications such as document digitization, but remains challenging due to high inter-class visual similarity, complex hierarchical structures, and diverse visual degradations in real-world scenes. To address these challenges, we propose a robust CTR framework that synergizes structure-aware visual discrimination with cross-modal reasoning. The hybrid encoder integrates multi-scale attention modulation and global self-attention, guided by hierarchical structural supervision, to capture fine-grained structure-aware visual representations essential for distinguishing visually similar characters. Complementing this, an iterative vision-language decoder, trained with a stochastic masking strategy, learns to reconstruct text from partially observed visual and contextual cues, enabling complementary vision-language reasoning that effectively resolves visually ambiguous or degraded characters. Extensive experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance, validating its effectiveness in addressing the challenges of Chinese text recognition.

  • New
  • Research Article
  • 10.1145/3806042
Heterogeneous Text Style Control Using Prompts
  • Apr 14, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Yafu Li + 4 more

Advancements in natural language processing (NLP) have markedly improved paraphrase generation, an essential task for numerous applications. However, current methods face limitations due to model and constraint specificity, which hinder their flexibility and practical deployment. In this work, we introduce a unified prompt-driven approach to paraphrase generation that leverages diverse prompts, enabling fine-grained user control over aspects such as syntax and sentiment. Moreover, we incorporate translation to enable sophisticated cross-lingual text controls. Our system employs a data-centric paradigm which organizes prompts with natural language instructions. The proposed method is compatible with various sequence-to-sequence architectures and utilizes a novel training strategy to address the versatility of prompt combinations. Empirical results show that our approach not only demonstrates its capacity to adhere to multiple user-defined constraints but also maintains high performance in generation tasks without prompts. Moreover, extensive analysis shows that the model exhibits robustness to prompt variance such as language and quantity.

  • New
  • Research Article
  • 10.1145/3797912
Optimizing Vietnamese Speech Recognition Models Through Dataset-Level Audio and Speech Characteristics
  • Apr 13, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Kiet Gia + 2 more

The effectiveness of Speech-to-Text (STT) models depends heavily on dataset-level audio and speech characteristics, yet the quantitative influence of these factors remains insufficiently explored, particularly for low-resource lauguages, such as Vietnamese. This study examines how specific audio and speech characteristics, including Speech Rate, Naturalness, Signal-To-Noise Ratio, Audio Coloration and Environmental Reverberation, affect STT performance for Vietnamese. Amongst them, naturalness is notably picked as a new evaluative characteristic with a dedicated metric for dataset selection. Experiments in a real-world setting with a social robots how that tailoring datasets based on these characteristics can respectively improve the accuracy of the trained models by approximately 2.66%, 4.72%, 8.36%, 5.89%, 5.00% compared to training on untailored ones. Additionally, models trained on curated datasets can outperform conventional pre-trained models by up to approximately 8.7% accuracy-wise, highlighting the effectiveness of our approach. The methodology is most useful in practical deployments - such as social robots, voice assistants, and contact-center systems - where field audio is noisier, reverberant, and produced by diverse, non-uniform speakers; its benefit diminishes once sufficiently large, representative training datasets exist.

  • New
  • Research Article
  • 10.1145/3798045
Deep Hierarchical Attention Based Approach for Multilingual Automatized Recommendation
  • Apr 13, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Nilufar Zaman + 1 more

In today’s technologically advanced society, online services are rapidly expanding, with a growing emphasis on customer satisfaction. To enhance the value of cloud services for users, it is essential to provide relevant and authentic recommendations. To address this requirement, our model DeepHaB-MMF integrates an automated recommendation system with advanced contextual and sequential embedding, designed to handle multilingual inputs. The first phase of our model is DeepHaB, which processes the user reviews by generating embedding that combine contextual information with extracted features. These embedding are then passed through a deep BiLSTM network to capture the bidirectional dependencies in the data. An attention mechanism further enhances the process by highlighting the most informative features which further help in classifying the truthful or fake reviews. These results are then provided with other extracted features to the next phase of the model i.e. DeepMMF which modifies the traditional matrix factorization technique to provide relevant recommendations. Thus our model, DeepHaB-MMF first filters out the fake review and based on only truthful reviews it provides authentic as well as relevant recommendations. This model is evaluated on three low resource languages like Hindi, Marathi and Bengali and the results clearly shows that it out performs other state-of-the-art approaches.

  • New
  • Research Article
  • 10.1145/3807779
A Dataset and Model for Personalized Chinese Stance Detection
  • Apr 13, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Qingying Sun + 6 more

Current stance detection models predominantly rely on text analysis, often overlooking the wealth of information embedded in user profiles. This oversight is largely attributed to the scarcity of stance detection datasets that encompass detailed user profiles. Bridging this gap, we have created the first dataset for Chinese stance detection that includes user profile information (PC-STANCE). The dataset contains 31,033 Chinese microblogs annotated with stances toward 8 targets, as well as user gender, age, and location information. We have conducted a detailed analysis of the dataset and performed extensive empirical experiments using classic neural network models. Our experiments achieve state-of-the-art results, surpassing several large language models like Llama and ChatGPT. This confirms the effectiveness of integrating user profile information in stance detection and underscores the challenging nature of the dataset. To facilitate further research in stance detection, we have made the dataset publicly available.

  • Research Article
  • 10.1145/3806198
Optimizing Low-Resource Machine Transliteration: A Case of Script Transition from Manipuri in Bengali Script to Meetei-Mayek
  • Mar 31, 2026
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Gourashyam Moirangthem + 1 more

At the expense of quantity and quality of training data, corpus-based models are becoming superior to rule-based models in solving complex Natural Language Processing (NLP) problems. In this research work, three categories of Machine Learning (ML) models for Machine Translatiteration (MTx) tasks are examined in a strictly low-resource scenario. This work studies and compares the Rule-Based Machine Transliteration (RBMTx) model, Statistical Machine Transliteration (SMTx) model and five generation-defining Neural Machine Transliteration (NMTx) models for the transliteration task in the low-resource Manipuri language. The work also discusses the contemporary script issues for the Manipuri language. The work explored and demonstrated how existing RBMTx models can facilitate corpus-based data-intensive machine learning models for low-resource languages using a novel technique for building parallel datasets. This study produced a gold-standard corpus of 35,000 Bengali script-Meetei Mayek parallel Manipuri words. With a Character Error Rate (CER) of only 0.66, a chrF score of 98.3, a BLEU score of 98.7 and a METEOR score of 99.00, the best performing Encoder-Decoder with Self Attention machine transliteration model sets a new performance record for the Bengali script to Meetei Mayek transliteration task. In addition to its immense potential for facilitating the ongoing script transition from Bengali script to Meetei Mayek, this research work will also help in addressing the low-resource bottleneck of Meetei Mayek for downstream Manipuri language NLP tasks.