Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data
Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.
- Research Article
25
- 10.1016/0167-8191(95)01017-9
- Oct 1, 1995
- Parallel Computing
Implementation and performance issues of a massively parallel atmospheric model
- Video Transcripts
- 10.48448/pyam-qs42
- Aug 1, 2021
- Underline Science Inc.
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.
- Research Article
5
- 10.1007/s00521-025-11145-1
- Apr 24, 2025
- Neural Computing and Applications
Grammatical error correction (GEC) in Arabic presents unique challenges arising from complex morphology and contextual intricacies. Current methodologies predominantly rely on neural machine translation (NMT) models, hindered by adequately annotated training data scarcity. This research introduces a novel approach utilizing pre-trained transformers, specifically sequence-to-sequence (seq2seq) models, such as AraT5 and AraBART, alongside their multilingual variants (mT5 and mBART), to address Arabic GEC. These transformers, initially designed for diverse natural language processing tasks, demonstrate promising results in GEC, particularly when parallel data are limited. Employing tokenization and preprocessing techniques on publicly accessible GEC datasets, we train the transformers using a supervised approach. The experimental results showcase superior performance, surpassing previous models with an F1 score of 92.1% on the QALB 2014 dataset, 89.4% on the QALB 2015 native test data, and 83.6% on non-native data. This highlights the effectiveness of the proposed methodology in rectifying various grammatical errors in Arabic text. In conclusion, this study contributes to advancing the field of Arabic GEC by leveraging transfer learning with pre-trained transformers. The findings underscore the potential of this approach to overcome challenges posed by limited data availability, with AraBART emerging as a practical choice. This research opens avenues for further exploration in low-resource languages. It suggests potential applications in high-resource languages, encouraging future comparative studies.
- Research Article
- 10.1016/j.neunet.2025.108114
- Feb 1, 2026
- Neural networks : the official journal of the International Neural Network Society
A domain-specific cross-lingual semantic alignment learning model for low-resource languages.
- Conference Article
56
- 10.1109/hpca.2005.30
- Feb 12, 2005
Many important applications exhibit large amounts of data parallelism, and modern computer systems are designed to take advantage of it. While much of the computation in the multimedia and scientific application domains is data parallel, certain operations require costly serialization that increase the run time. Examples include superposition type updates in scientific computing and histogram computations in media processing. We introduce scatter-add, which is the data-parallel form of the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add mechanism scatters a set of data values to a set of memory addresses and adds each data value to each referenced memory location instead of overwriting it. This novel architecture extension allows us to efficiently support data-parallel atomic update computations found in parallel programming languages such as HPF, and applies both to single-processor and multiprocessor SIMD data-parallel systems. We detail the microarchitecture of a scatter-add implementation on a stream architecture, which requires less than 2% increase in die area yet shows performance speedups ranging from 1.45 to over 11 on a set of applications that require a scatter-add computation.
- Research Article
2
- 10.3390/math12132107
- Jul 4, 2024
- Mathematics
Cross-lingual summarization (CLS) is essential for enhancing global communication by facilitating efficient information exchange across different languages. However, owing to the scarcity of CLS data, recent studies have employed multi-task frameworks to combine parallel monolingual summaries. These methods often use independent decoders or models with non-shared parameters because of the mismatch in output languages, which limits the transfer of knowledge between CLS and its parallel data. To address this issue, we propose a unified training method for CLS that combines parallel machine translation (MT) pairs with CLS pairs, jointly training them within a single model. This design ensures consistent input and output languages and promotes knowledge sharing between the two tasks. To further enhance the model’s capability to focus on key information, we introduce two additional loss terms to align the hidden representations and probability distributions between the parallel MT and CLS pairs. Experimental results demonstrate that our method outperforms competitive methods in both full-dataset and low-resource scenarios on two benchmark datasets, Zh2EnSum and En2ZhSum.
- Conference Article
137
- 10.1145/3540250.3549113
- Nov 7, 2022
Pre-trained models have been shown effective in many code intelligence tasks. These models are pre-trained on large-scale unlabeled corpus and then fine-tuned in downstream tasks. However, as the inputs to pre-training and downstream tasks are in different forms, it is hard to fully explore the knowledge of pre-trained models. Besides, the performance of fine-tuning strongly relies on the amount of downstream data, while in practice, the scenarios with scarce data are common. Recent studies in the natural language processing (NLP) field show that prompt tuning, a new paradigm for tuning, alleviates the above issues and achieves promising results in various NLP tasks. In prompt tuning, the prompts inserted during tuning provide task-specific knowledge, which is especially beneficial for tasks with relatively scarce data. In this paper, we empirically evaluate the usage and effect of prompt tuning in code intelligence tasks. We conduct prompt tuning on popular pre-trained models CodeBERT and CodeT5 and experiment with three code intelligence tasks including defect prediction, code summarization, and code translation. Our experimental results show that prompt tuning consistently outperforms fine-tuning in all three tasks. In addition, prompt tuning shows great potential in low-resource scenarios, e.g., improving the BLEU scores of fine-tuning by more than 26\% on average for code summarization. Our results suggest that instead of fine-tuning, we could adapt prompt tuning for code intelligence tasks to achieve better performance, especially when lacking task-specific data.
- Single Report
18
- 10.21236/ada274125
- Nov 1, 1993
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compiler framework, and such a compiler can map a single source program in many different ways onto a parallel machine. There are several complex tradeoffs between task and data parallelism, depending on the characteristics of the program to be executed and the performance parameters of the target parallel machine. This makes it very difficult for a programmer to select a good mapping for a task and data parallel program. In this paper we isolate and examine specific characteristics of executing programs that determine the performance for different mappings on a parallel machine, and present an automatic system to obtain good mappings. The process consists of two steps: First, an instrumented input program is executed a fixed number of times with different mappings, to build an execution model of the program. Next, the model is analyzed to obtain a good final mapping of the program onto the processors of the parallel machine. The current implementation is static, feedback driven, although the approach can be extended to a dynamic system. We demonstrate the system with an example program that is a model for many applications in the domains of signal processing and image processing.
- Research Article
- 10.1080/09617353.2001.11690717
- Jun 1, 2001
- Safety and Reliability
Formal safety assessment is devoted to the process of estimating the safety of a product and identifying appropriate measures to reduce system risks to an acceptable level. It is important to be able to account for the possibility of vague or scarce probabilistic data in parallel with quantitative safety assessment procedures. This paper describes several practical approaches by which the confidence in a failure probability estimate can be modelled and integrated into any path based probabilistic safety assessment, resulting in top event confidence measures. The modified Boolean representation method (MBRM) is used to demonstrate the procedures.
- Research Article
36
- 10.1093/jamia/ocac149
- Sep 9, 2022
- Journal of the American Medical Informatics Association
A survey of automated methods for biomedical text simplification.
- Video Transcripts
- 10.48448/hgcr-f179
- May 7, 2022
- Underline Science Inc.
We exploit the pre-trained seq2seq model mBART for multilingual text style transfer. Using machine translated data as well as gold aligned English sentences yields state-of-the-art results in the three target languages we consider. Besides, in view of the general scarcity of parallel data, we propose a modular approach for multilingual formality transfer, which consists of two training strategies that target adaptation to both language and task. Our approach achieves competitive performance without monolingual task-specific parallel data and can be applied to other style transfer tasks as well as to other languages.
- Conference Article
19
- 10.18653/v1/2021.naacl-main.459
- Jan 1, 2021
Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.
- Video Transcripts
- 10.48448/cxex-xm40
- May 25, 2021
- Underline Science Inc.
Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.
- Conference Article
3
- 10.24963/ijcai.2021/547
- Aug 1, 2021
Stylized neural machine translation (NMT) aims to translate sentences of one style into sentences of another style, which is essential for the application of machine translation in a real-world scenario. However, a major challenge in this task is the scarcity of high-quality parallel data which is stylized paired. To address this problem, we propose an iterative dual knowledge transfer framework that utilizes informal training data of machine translation and formality style transfer data to create large-scale stylized paired data, for the training of stylized machine translation model. Specifically, we perform bidirectional knowledge transfer between translation model and text style transfer model iteratively through knowledge distillation. Then, we further propose a data-refinement module to process the noisy synthetic parallel data generated during knowledge transfer. Experiment results demonstrate the effectiveness of our method, achieving an improvement over the existing best model by 5 BLEU points on MTFC dataset. Meanwhile, extensive analyses illustrate our method can also improve the accuracy of formality style transfer.
- Conference Article
200
- 10.1145/1941553.1941562
- Feb 12, 2011
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.