Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.

Similar Papers
  • Research Article
  • Cite Count Icon 25
  • 10.1016/0167-8191(95)01017-9
Implementation and performance issues of a massively parallel atmospheric model
  • Oct 1, 1995
  • Parallel Computing
  • Steven W Hammond + 3 more

Implementation and performance issues of a massively parallel atmospheric model

  • Video Transcripts
  • 10.48448/pyam-qs42
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
  • Aug 1, 2021
  • Underline Science Inc.
  • Wei-Jen Ko

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

  • Research Article
  • Cite Count Icon 5
  • 10.1007/s00521-025-11145-1
Transformers to the rescue: alleviating data scarcity in arabic grammatical error correction with pre-trained models
  • Apr 24, 2025
  • Neural Computing and Applications
  • Karim Ismail + 3 more

Grammatical error correction (GEC) in Arabic presents unique challenges arising from complex morphology and contextual intricacies. Current methodologies predominantly rely on neural machine translation (NMT) models, hindered by adequately annotated training data scarcity. This research introduces a novel approach utilizing pre-trained transformers, specifically sequence-to-sequence (seq2seq) models, such as AraT5 and AraBART, alongside their multilingual variants (mT5 and mBART), to address Arabic GEC. These transformers, initially designed for diverse natural language processing tasks, demonstrate promising results in GEC, particularly when parallel data are limited. Employing tokenization and preprocessing techniques on publicly accessible GEC datasets, we train the transformers using a supervised approach. The experimental results showcase superior performance, surpassing previous models with an F1 score of 92.1% on the QALB 2014 dataset, 89.4% on the QALB 2015 native test data, and 83.6% on non-native data. This highlights the effectiveness of the proposed methodology in rectifying various grammatical errors in Arabic text. In conclusion, this study contributes to advancing the field of Arabic GEC by leveraging transfer learning with pre-trained transformers. The findings underscore the potential of this approach to overcome challenges posed by limited data availability, with AraBART emerging as a practical choice. This research opens avenues for further exploration in low-resource languages. It suggests potential applications in high-resource languages, encouraging future comparative studies.

  • Research Article
  • 10.1016/j.neunet.2025.108114
A domain-specific cross-lingual semantic alignment learning model for low-resource languages.
  • Feb 1, 2026
  • Neural networks : the official journal of the International Neural Network Society
  • Yurong Wang + 5 more

A domain-specific cross-lingual semantic alignment learning model for low-resource languages.

  • Conference Article
  • Cite Count Icon 56
  • 10.1109/hpca.2005.30
Scatter-Add in Data Parallel Architectures
  • Feb 12, 2005
  • Jung Ho Ahn + 2 more

Many important applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations require costly serialization that increase the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. We introduce scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it. This novel architecture extension allows us to efficiently support data-parallel atomic update computations found in parallel programming languages such as HPF, and applies both to single-processor and multiprocessor SIMD data-parallel systems. We detail the microarchitecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter-add computation.

  • Research Article
  • Cite Count Icon 2
  • 10.3390/math12132107
Unified Training for Cross-Lingual Abstractive Summarization by Aligning Parallel Machine Translation Pairs
  • Jul 4, 2024
  • Mathematics
  • Shaohuan Cheng + 4 more

Cross-lingual summarization (CLS) is essential for enhancing global communication by facilitating efficient information exchange across different languages. However, owing to the scarcity of CLS data, recent studies have employed multi-task frameworks to combine parallel monolingual summaries. These methods often use independent decoders or models with non-shared parameters because of the mismatch in output languages, which limits the transfer of knowledge between CLS and its parallel data. To address this issue, we propose a unified training method for CLS that combines parallel machine translation (MT) pairs with CLS pairs, jointly training them within a single model. This design ensures consistent input and output languages and promotes knowledge sharing between the two tasks. To further enhance the model’s capability to focus on key information, we introduce two additional loss terms to align the hidden representations and probability distributions between the parallel MT and CLS pairs. Experimental results demonstrate that our method outperforms competitive methods in both full-dataset and low-resource scenarios on two benchmark datasets, Zh2EnSum and En2ZhSum.

  • Conference Article
  • Cite Count Icon 137
  • 10.1145/3540250.3549113
No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence
  • Nov 7, 2022
  • Chaozheng Wang + 5 more

Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in different forms, it is hard to fully explore the knowledge of pre-trained models. Besides, the performance of fine-tuning strongly relies on the amount of downstream data, while in practice, the scenarios with scarce data are common. Recent studies in the natural language processing (NLP) field show that prompt tuning, a new paradigm for tuning, alleviates the above issues and achieves promising results in various NLP tasks. In prompt tuning, the prompts inserted during tuning provide task-specific knowledge, which is especially beneficial for tasks with relatively scarce data. In this paper, we empirically evaluate the usage and effect of prompt tuning in code intelligence tasks. We conduct prompt tuning on popular pre-trained models CodeBERT and CodeT5 and experiment with three code intelligence tasks including defect prediction, code summarization, and code translation. Our experimental results show that prompt tuning consistently outperforms fine-tuning in all three tasks. In addition, prompt tuning shows great potential in low-resource scenarios, e.g., improving the BLEU scores of fine-tuning by more than 26\% on average for code summarization. Our results suggest that instead of fine-tuning, we could adapt prompt tuning for code intelligence tasks to achieve better performance, especially when lacking task-specific data.

  • Single Report
  • Cite Count Icon 18
  • 10.21236/ada274125
Automatic Mapping of Task and Data Parallel Programs for Efficient Execution on Multicomputers
  • Nov 1, 1993
  • Jaspal Subhlok

For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compiler framework, and such a compiler can map a single source program in many different ways onto a parallel machine. There are several complex tradeoffs between task and data parallelism, depending on the characteristics of the program to be executed and the performance parameters of the target parallel machine. This makes it very difficult for a programmer to select a good mapping for a task and data parallel program. In this paper we isolate and examine specific characteristics of executing programs that determine the performance for different mappings on a parallel machine, and present an automatic system to obtain good mappings. The process consists of two steps: First, an instrumented input program is executed a fixed number of times with different mappings, to build an execution model of the program. Next, the model is analyzed to obtain a good final mapping of the program onto the processors of the parallel machine. The current implementation is static, feedback driven, although the approach can be extended to a dynamic system. We demonstrate the system with an example program that is a model for many applications in the domains of signal processing and image processing.

  • Research Article
  • 10.1080/09617353.2001.11690717
Methods for Modelling the Confidence in Probabilistic Failure Estimates
  • Jun 1, 2001
  • Safety and Reliability
  • C F Cain + 2 more

Formal safety assessment is devoted to the process of estimating the safety of a product and identifying appropriate measures to reduce system risks to an acceptable level. It is important to be able to account for the possibility of vague or scarce probabilistic data in parallel with quantitative safety assessment procedures. This paper describes several practical approaches by which the confidence in a failure probability estimate can be modelled and integrated into any path based probabilistic safety assessment, resulting in top event confidence measures. The modified Boolean representation method (MBRM) is used to demonstrate the procedures.

  • Research Article
  • Cite Count Icon 36
  • 10.1093/jamia/ocac149
A survey of automated methods for biomedical text simplification.
  • Sep 9, 2022
  • Journal of the American Medical Informatics Association
  • Brian Ondov + 2 more

A survey of automated methods for biomedical text simplification.

  • Video Transcripts
  • 10.48448/hgcr-f179
Multilingual Pre-training with Language and Task Adaptation for Multilingual Text Style Transfer
  • May 7, 2022
  • Underline Science Inc.
  • Huiyuan Lai + 2 more

We exploit the pre-trained seq2seq model mBART for multilingual text style transfer. Using machine translated data as well as gold aligned English sentences yields state-of-the-art results in the three target languages we consider. Besides, in view of the general scarcity of parallel data, we propose a modular approach for multilingual formality transfer, which consists of two training strategies that target adaptation to both language and task. Our approach achieves competitive performance without monolingual task-specific parallel data and can be applied to other style transfer tasks as well as to other languages.

  • Conference Article
  • Cite Count Icon 19
  • 10.18653/v1/2021.naacl-main.459
Training Data Augmentation for Code-Mixed Translation
  • Jan 1, 2021
  • Abhirut Gupta + 2 more

Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.

  • Video Transcripts
  • 10.48448/cxex-xm40
Training Data Augmentation for Code-Mixed Translation
  • May 25, 2021
  • Underline Science Inc.
  • Sunita Sarawagi + 2 more

Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.

  • Conference Article
  • Cite Count Icon 3
  • 10.24963/ijcai.2021/547
Improving Stylized Neural Machine Translation with Iterative Dual Knowledge Transfer
  • Aug 1, 2021
  • Xuanxuan Wu + 6 more

Stylized neural machine translation (NMT) aims to translate sentences of one style into sentences of another style, which is essential for the application of machine translation in a real-world scenario. However, a major challenge in this task is the scarcity of high-quality parallel data which is stylized paired. To address this problem, we propose an iterative dual knowledge transfer framework that utilizes informal training data of machine translation and formality style transfer data to create large-scale stylized paired data, for the training of stylized machine translation model. Specifically, we perform bidirectional knowledge transfer between translation model and text style transfer model iteratively through knowledge distillation. Then, we further propose a data-refinement module to process the noisy synthetic parallel data generated during knowledge transfer. Experiment results demonstrate the effectiveness of our method, achieving an improvement over the existing best model by 5 BLEU points on MTFC dataset. Meanwhile, extensive analyses illustrate our method can also improve the accuracy of formality style transfer.

  • Conference Article
  • Cite Count Icon 200
  • 10.1145/1941553.1941562
Copperhead
  • Feb 12, 2011
  • Bryan Catanzaro + 2 more

Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.

Save Icon
Up Arrow
Open/Close