Semi-synthetic Data Research Articles

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

Read full abstract

Context:The Data Stream Processing (DSP) approach focuses on real-time data processing by applying specific techniques for capturing and processing relevant data for on-the-fly results, i.e. without necessarily requiring prior storage. Like in any other software, testing plays a vital role in the quality assurance of DSP applications. However, testing such kind of software is not a simple task. In this context, some factors that make challenging testing are message temporality, parallelism, data volume, complex infrastructure, variability, and speed of messages. Objective:This work aims to map and synthesize industry knowledge and experience regarding DSP application testing. Specifically, we want to know about challenges, test purposes, test approaches, test data sources, and adopted tools. Method:To achieve the objective, we performed a Grey Literature Review (e.g., blog posts, white papers, discussion lists, lecture themes at technical events, professional social networks, software repositories, and other web-published) on testing DSP applications. We searched the grey literature using Google’s regular search engine in addition to specific searches on technical software development content websites. The selected studies were analyzed using qualitative and quantitative techniques. Results:Results are based on evidence from 154 selected sources. The challenges for testing DSP applications are the complexity of DSP applications, test infrastructure complexity, timing, and data acquisition issues. The main test objectives identified are functional suitability, performance efficiency, reliability, and maintainability. The main test approaches reported: Performance Testing, Regression Testing, Property-Based Testing, Chaos Testing, and Contract/Schema Testing. The strategies adopted by practitioners to obtain test data: Historical Data, Production Data Mirroring, Semi-Synthetic Data, and Synthetic Data. We also report 50 tools used in various testing activities, which are used for: automating infrastructure, generating test data, test utilities, dealing with timing issues, load generation, simulation, and others. Furthermore, we identified gaps and opportunities for future scientific work. Conclusion:This work selected and summarized content produced by practitioners regarding DSP application testing. We identified that knowledge, techniques, and tools intrinsic to the practice were not present in the formal literature, so this study helps reduce the gap between industry and academia on this topic. The document has delivered benefits to industry practitioners and academic researchers.

Read full abstract

Semi-synthetic Data Research Articles

Related Topics

Articles published on Semi-synthetic Data

A Machine Learning Framework for Assessing Experts’ Decision Quality

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Enhancing Resilience in Biometric Research: Generation of 3D Synthetic Face Data Using Advanced 3D Character Creation Techniques from High-Fidelity Video Games and Animation.

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration.

Deep learning models for crack segmentation in 3d images of concrete trained on semi-synthetic data

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.

THE UTILIZATION OF SYNTHETIC AND SEMISYNTHETIC POINT CLOUDS AND IMAGES FOR TESTING NOVEL APPROACHES FOR CORRECTING LIDAR DATA

Road safety evaluation with multiple treatments: A comparison of methods based on simulations

Use of semi-synthetic data for catheter segmentation improvement.

On the Estimation of Spatial Density From Mobile Network Operator Data

A Grey Literature Review on Data Stream Processing applications testing

A Framework to Maximize Group Fairness for Workers on Online Labor Platforms

The K-mer antibiotic resistance gene variant analyzer (KARGVA).

An improved semi-synthetic approach for creating visual-inertial odometry datasets

Optimizing the preventive maintenance frequency with causal machine learning

Test-data generation and integration for long-distance e-vehicle routing

An improved semi-synthetic approach for creating visual-inertial odometry datasets

Semi-synthetic data generation to fine-tune a convolutional neural network for retrieving Raman signals from CARS spectra

The Exponomial Choice Model for Assortment Optimization: An Alternative to the MNL Model?

Estimating the Individual Treatment Effect on Survival Time Based on Prior Knowledge and Counterfactual Prediction.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Semi-synthetic Data Research Articles

Related Topics

Articles published on Semi-synthetic Data

A Machine Learning Framework for Assessing Experts’ Decision Quality

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Enhancing Resilience in Biometric Research: Generation of 3D Synthetic Face Data Using Advanced 3D Character Creation Techniques from High-Fidelity Video Games and Animation.

PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration.

Deep learning models for crack segmentation in 3d images of concrete trained on semi-synthetic data

Semi-supervised ROC analysis for reliable and streamlined evaluation of phenotyping algorithms.

THE UTILIZATION OF SYNTHETIC AND SEMISYNTHETIC POINT CLOUDS AND IMAGES FOR TESTING NOVEL APPROACHES FOR CORRECTING LIDAR DATA

Road safety evaluation with multiple treatments: A comparison of methods based on simulations

Use of semi-synthetic data for catheter segmentation improvement.

On the Estimation of Spatial Density From Mobile Network Operator Data

A Grey Literature Review on Data Stream Processing applications testing

A Framework to Maximize Group Fairness for Workers on Online Labor Platforms

The K-mer antibiotic resistance gene variant analyzer (KARGVA).

An improved semi-synthetic approach for creating visual-inertial odometry datasets

Optimizing the preventive maintenance frequency with causal machine learning

Test-data generation and integration for long-distance e-vehicle routing

An improved semi-synthetic approach for creating visual-inertial odometry datasets

Semi-synthetic data generation to fine-tune a convolutional neural network for retrieving Raman signals from CARS spectra

The Exponomial Choice Model for Assortment Optimization: An Alternative to the MNL Model?

Estimating the Individual Treatment Effect on Survival Time Based on Prior Knowledge and Counterfactual Prediction.