Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Loop Unrolling
  • Loop Unrolling

Articles published on Compiler Optimizations

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1095 Search results
Sort by
Recency
  • New
  • Research Article
  • 10.1016/j.asoc.2026.115066
Two-level automatic software optimization using cooperative co-evolutionary algorithms
  • Jun 1, 2026
  • Applied Soft Computing
  • José Miguel Aragón-Jurado + 4 more

The rapid expansion of specialized hardware architectures has significantly increased the complexity of software optimization. Modern computing systems now incorporate diverse processors, co-processors, and heterogeneous execution environments. Each type of hardware requires specific optimization strategies to fully exploit its computational potential. Therefore, generic compiler optimizations often fail to account for intricate software-hardware interactions, leading to inefficiencies or performance degradation. Moreover, evolving compilation frameworks like LLVM continually introduce new optimizations and modify their optimization managers. This constant evolution makes it challenging to establish standardized optimization strategies. In this work, we first analyze the impact of the two existing pass managers in LLVM on automatic software optimization. Then, we define a novel two-level combinatorial optimization problem that leverages both pass managers for improved runtime performance. We solve this problem using a cooperative co-evolutionary cellular genetic algorithm and conduct extensive experiments to evaluate the impact of the different pass managers on software runtime. Specifically, we assess three optimization strategies, considering the legacy and new pass managers. Results demonstrate that the proposed methodology allows significantly enhancing the runtime efficiency of the considered software, achieving up to runtime improvement over the non-optimized program and over the best existing optimization approaches. • Examines the impact of the legacy and new LLVM pass managers on automatic software optimization. • Introduces a two-level optimization framework that integrates both pass managers. • Employs a cooperative co-evolutionary strategy to address the joint optimization problem. • Demonstrates substantial runtime improvements—up to 99.41

  • Research Article
  • 10.22331/q-2026-04-13-2061
Efficient and high-performance routing of lattice-surgery paths on three-dimensional lattice
  • Apr 13, 2026
  • Quantum
  • Kou Hamada + 2 more

Encoding logical qubits with surface codes and performing multi-qubit logical operations with lattice surgery is one of the most promising approaches to demonstrate fault-tolerant quantum computing. Thus, a method to efficiently schedule a sequence of lattice-surgery operations is vital for high-performance fault-tolerant quantum computing. A possible strategy to improve the throughput of lattice-surgery operations is splitting a large instruction into several small instructions, such as Bell state preparation and measurements, and executing a part of them in advance. However, scheduling methods to fully utilize this idea have yet to be explored. In this paper, we propose a fast and high-performance scheduling algorithm for lattice-surgery instructions leveraging this strategy. We achieved this by converting the scheduling problem of lattice-surgery instructions to a graph problem of embedding 3D paths into a 3D lattice, which enables us to explore efficient scheduling by solving path search problems in the 3D lattice. Based on this reduction, we propose a method to solve the path-finding problems, the look-ahead Dijkstra projection. We numerically show that this method reduced the execution time of benchmark programs generated from quantum phase estimation algorithms by 3.8 times compared with a naive method based on greedy algorithms. Our study establishes the relation between the lattice-surgery scheduling and graph search problems, which leads to further theoretical analysis on compiler optimization of fault-tolerant quantum computing.

  • Research Article
  • 10.1109/tnnls.2026.3677427
FLASH: Energy-Efficient FPGA Acceleration via Linear Approximation and Streamlined Two-Stage Pipeline Architectures for Quantized CNN-Transformer Hybrid Networks.
  • Apr 1, 2026
  • IEEE transactions on neural networks and learning systems
  • Nam Joon Kim + 7 more

Hybrid Vision Transformers (HybridViTs), which integrate convolutional neural networks (CNNs) with Transformer blocks, offer both local and global feature extraction capabilities, achieving high performance across a range of computer vision tasks. However, the substantial computational asymmetry between lightweight CNN blocks and compute-intensive Transformer blocks presents significant challenges for simultaneous optimization and acceleration within a single hardware architecture. To address these challenges, we propose FLASH, a power-efficient field-programmable gate array (FPGA)-based accelerator tailored for CNN-Transformer hybrid networks. FLASH reduces quantization overhead by consolidating redundant quantization-dequantization operations into a single requantization step and enables 8-bit integer-only computation for residual connections through proper scaling factor handling. To further optimize for hardware efficiency, FLASH introduces hardware-friendly linear approximations of nonlinear functions such as Swish and Softmax. By precomputing row-wise max values through offline calibration, we eliminate both max-value search logic and intermediate memory buffering overhead, while reusing shared integer-exponential units to minimize resource consumption. Architecturally, FLASH employs a two-stage pipeline: Stage 1 eliminates external DRAM access using a fully pipelined MobileNetV2 backbone, while Stage 2 accelerates Transformer and convolutional components through specialized compute units and dataflow optimizations. Experimental evaluation using MobileViT (MViT)-xxs on Xilinx VCU118 FPGA demonstrates that FLASH incurs only a 0.84% accuracy drop on ImageNet-1K compared to the FP32 baseline, while achieving up to $16.8\times $ lower power consumption and $26.3\times $ improvement in energy efficiency relative to CPU/GPU implementations. These results establish FLASH as an energy-efficient hardware accelerator for real-time inference of HybridViT models on edge devices.

  • Research Article
  • 10.52710/cfs.984
Closed-Loop Binary Optimization: Integrating De-Identified Production Telemetry into the Build Lifecycle
  • Mar 16, 2026
  • Computer Fraud and Security
  • Varun Raj

Modern optimization techniques for performance mainly operate on the final binary emitted by the compiler. Profile-Guided Optimization (PGO) is a model of performance optimization: rather than applying heuristics to select optimizations at compile time, PGO selects optimizations based on run-time profiling of the program. Static compilation cannot predict the dynamic control flow. The cache behavior will also depend on the workload running in production machines. By measuring the execution in production, compilers can learn the frequency of hot paths and the requirements of branch prediction, caches, and instruction scheduling. Instrumentation overhead is reduced by a load-test infrastructure that runs copies of production traffic. Privacy-sensitive user data is sanitized by privacy-preserving de-identification pipelines. Query structure is preserved to allow possible optimizations in the process of data management. Continuous profiling maintains its effectiveness over time as both execution environments and workloads change. Autotuning, the process of finding optimal compiler settings for the specific workload, is increasingly realized through machine learning techniques. When deployed as standard infrastructure at the production grade, binary optimization offers new economic value through better resource utilization and lower latency services, and can offer a virtuous circle of improvement for high-performance digital infrastructure everywhere through using real-world telemetry to feed into the compiler toolchain.

  • Research Article
  • 10.1109/tvcg.2025.3627171
Reimagining Disassembly Interfaces With Visualization: Combining Instruction Tracing and Control Flow With DisViz.
  • Feb 1, 2026
  • IEEE transactions on visualization and computer graphics
  • Shadmaan Hye + 2 more

In applications where efficiency is critical, developers may examine their compiled binaries, seeking to understand how the compiler transformed their source code and what performance implications that transformation may have. This analysis is challenging due to the vast number of disassembled binary instructions and the many-to-many mappings between them and the source code. These problems are exacerbated as source code size increases, giving the compiler more freedom to map and disperse binary instructions across the disassembly space. Interfaces for disassembly typically display instructions as an unstructured listing or sacrifice the order of execution. We design a new visual interface for disassembly code that combines execution order with control flow structure, enabling analysts to both trace through code and identify familiar aspects of the computation. Central to our approach is a novel layout of instructions grouped into basic blocks that displays a looping structure in an intuitive way. We add to this disassembly representation a unique block-based mini-map that leverages our layout and shows context across thousands of disassembly instructions. Finally, we embed our disassembly visualization in a web-based tool, DisViz, which adds dynamic linking with source code across the entire application. DizViz was developed in collaboration with program analysis experts following design study methodology and was validated through evaluation sessions with ten participants from four institutions. Participants successfully completed the evaluation tasks, hypothesized about compiler optimizations, and noted the utility of our new disassembly view. Our evaluation suggests that our new integrated view helps application developers in understanding and navigating disassembly code.

  • Research Article
  • 10.31891/2307-5732-2026-361-81
АРХІТЕКТУРНІ ПІДХОДИ ТА МЕТОДИ ПАРАМЕТРИЧНОЇ ОПТИМІЗАЦІЇ РОЗПОДІЛЕНОГО ЗБЕРЕЖЕННЯ ДАНИХ В DRILLING ЕКОСИСТЕМІ
  • Jan 29, 2026
  • Herald of Khmelnytskyi National University. Technical sciences
  • Андрій Павлів

The current complex problems in the field of well construction and drilling automation require the introduction of innovative technologies to improve the efficiency of information support systems. In the context of the transition to the Drilling 4.0 concept, the rapid increase in the number of sensors and the implementation of high-frequency telemetry systems leads to the generation of massive volumes of heterogeneous data (Big Data). Transmitting this data to centralized cloud storage in real-time is often impossible due to the limited bandwidth of satellite communication channels and strict latency requirements. The scientific research represented in this article offers an integrated approach that provides a critical analysis of existing commercial drilling automation solutions, revealing that most of them focus on the algorithmization of mechanical processes, neglecting the architectural optimization of data flows. Traditional centralized methods of data management often result in significant latency and communication resource costs. With the increasing volume and complexity of telemetry data to be interpreted, it is becoming increasingly important to use new, automated architectural approaches. The purpose of this study is to improve conventional methods of data flow management by introducing a transition from a centralized to a hierarchical three-level architecture of distributed data storage (Edge-Fog-Cloud). A logical model of the system's operation has been developed, covering primary signal processing at the Edge level, aggregation at the Fog level, and global analytics at the Cloud level. The well-defined purpose of the study allows us to focus on the possibilities of introducing the latest computational technologies into the automated drilling process. The structure of the work is logically organized, ensuring consistency and coherence of the presentation. The methodological part of the study is characterized by an ideal combination of theory and practice, which makes the proposed approach understandable and useful for the scientific community. For the first time, the problem of parametric optimization of data distribution is formalized, and the use of Reinforcement Learning methods is substantiated for the dynamic adaptation of system parameters. A significant advantage is the development of algorithmic principles that take into account the specific needs and limitations when working with high-frequency telemetry data. This is a significant contribution to solving the problem of computational and network complexity that often arises in the context of remote drilling operations. The scientific novelty of the research consists in the implementation of adaptive optimization methods that open up new perspectives in the automation of data distribution processes based on the continuous collection of telemetry signals. Consequently, this scientific article is intended not only to improve the theoretical foundations of distributed databases but also to contribute to the improvement of existing approaches to drilling operations in the field. To practically verify and illustrate the proposed approaches, a custom software web application was developed. This module simulates the reception of streaming telemetry measurements, the real-time calculation of derived indicators (such as mechanical specific energy), and the formation of structured data packets for subsequent transmission from the Edge to the Fog level. Provided results can be effectively integrated into the practice of modern drilling automation, demonstrating broad potential for further development in this area by ensuring the operation of real-time control loops, guaranteeing data integrity, and reducing operational costs for IT infrastructure.

  • Research Article
  • 10.1109/access.2026.3651527
Evaluation of Thread Scalability and Compiler Optimization in Parallel Job Execution Using Unity’s Job System
  • Jan 1, 2026
  • IEEE Access
  • Patrick Rodney De Souza Machado + 2 more

Performance optimization is a critical concern in modern game development. Common bottlenecks arise from the high computational demands of game processing, physics simulations, and object interactions, challenges often exacerbated by the limitations of sequential processing. To address these issues, Unity, one of the most widely adopted game engines, offers the Unity Jobs System and Burst Compiler, which enable data-oriented parallel execution of code for improved efficiency. This study investigates the application of these tools to parallelize and optimize sequential algorithms. We analyze key performance factors, including thread scalability, agent count, and batch size. The results demonstrate that tailored configurations can yield performance improvements of up to 15.06×, with the Burst Compiler being essential in all scenarios. It is worth noting that single-threaded job implementations can sometimes rival multithreaded performance, particularly in smaller workloads, highlighting the benefits of transitioning from the MonoBehaviour architecture to Unity’s Data-Oriented Technology Stack (DOTS). As a contribution, this paper presents an in-depth evaluation of parallelization and optimization strategies within Unity’s Jobs System, with a focus on performance and scalability.

  • Research Article
  • 10.1109/les.2026.3677498
MQIL: Model Checking based Quantification of Information Leakage
  • Jan 1, 2026
  • IEEE Embedded Systems Letters
  • Priyanka Panigrahi + 1 more

Compiler optimization may introduce information leakage (IL) in a program. This opens up the scope of side-channel attacks through which secret inputs, such as keys, may leak via intermediate variables in cryptographic applications. Taint analysis is a well-known approach that tracks the information flow in a program. However, it has the problem of either under-tainting, leading to false negatives, or over-tainting, leading to false positive scenarios. In this work, we overcome these problems by tracking the information flow in a C program using a bounded model checker. Conceptually, we create a miter program of the input program and add assertions to verify the security property related to information flow using a model checker. In our experiment, we run the proposed approach for various cryptographic applications and show that it can successfully track the IL. Our experiments reveal that LLVM actually introduces IL during optimizations.

  • Research Article
  • 10.1109/access.2026.3668840
Analyzing RISC-V compiler toolchain by adopting topic modelling
  • Jan 1, 2026
  • IEEE Access
  • Kirrat Shaikh + 6 more

Recently, developers have increasingly relied on open repositories and mail archives to build software, particularly in specialized domains where structured documentation is scarce. However, navigating and extracting useful knowledge from such scattered sources is a challenging and time-consuming task. This paper presents the first systematic effort to organize and analyze GitHub commit messages and mailing list patches using <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">topic modeling</i> techniques. The proposed technique is applied to the RISC-V compiler toolchain, where development primarily depends on code repositories and community discussions. By jointly modeling these heterogeneous sources, our method identifies recurring compiler related themes such as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">auto-vectorization, intrinsics,</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">data types</i>, enabling efficient retrieval of development knowledge. Our evaluation shows that for GitHub commit messages, Latent Semantic Analysis (LSA) achieves the highest CV coherence, while BERTopic provides the greatest topic diversity. For mailing list patches, BERTopic outperforms other models in CV coherence, whereas Word2Vec leads in topic diversity. In addition, we demonstrate practical retrieval scenarios using queries such as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">autovec vmerge, intrinsic vfdiv</i>, and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">jalr uint32_t</i>, highlighting key concerns related to efficient code generation, floating-point precision, and address calculation optimizations in the RISC-V compiler. Overall, the experimental results indicate that topic modeling effectively captures development trends that are difficult to uncover through manual inspection or keyword-based search. By providing an organized and coherent view of scattered knowledge, our approach helps bridge knowledge gaps in complex technical domains and accelerates development where resources are limited.

  • Research Article
  • 10.33545/27076636.2026.v7.i1a.145
A comparative research of cache-friendly data structures for beginner-level algorithms
  • Jan 1, 2026
  • International Journal of Computing, Programming and Database Management
  • Lukas Schneider + 1 more

This research examines cache-friendly data structures in the context of beginner-level algorithms, focusing on how memory access patterns influence practical performance beyond asymptotic complexity. While introductory algorithm courses emphasize Big-O analysis, modern processors rely heavily on cache hierarchies, making spatial and temporal locality critical to execution efficiency. The research compares arrays, linked lists, dynamic arrays, hash tables, and tree-based structures under common beginner algorithms such as linear search, traversal, insertion, and simple sorting. Controlled experiments were conducted using identical datasets, fixed compiler optimizations, and consistent hardware configurations to isolate cache behavior effects. Performance metrics included execution time, cache miss rates, and instruction counts. Results indicate that contiguous-memory structures, particularly arrays and dynamic arrays, consistently outperform pointer-based structures in traversal-heavy tasks due to superior cache utilization. Linked lists and naïve tree implementations exhibited higher cache miss penalties, even when theoretical complexity was comparable. Hash tables demonstrated mixed behavior, with cache efficiency strongly dependent on load factor and collision resolution strategy. The findings highlight a persistent gap between theoretical instruction and real-world performance intuition for novice programmers. By demonstrating measurable performance differences using simple algorithms, the research provides pedagogical evidence that cache awareness can be introduced early without overwhelming learners. The comparative analysis supports integrating memory locality concepts into beginner curricula to foster more accurate mental models of performance. Ultimately, the research argues that teaching cache-friendly data structure selection alongside algorithmic complexity improves code efficiency, scalability, and systems-level understanding. These insights are intended to guide educators in curriculum design and help beginners develop performance-conscious programming habits from the outset, aligning foundational algorithm education with contemporary hardware realities. Such alignment reinforces practical reasoning, encourages empirical evaluation, and bridges theory with systems thinking, enabling novices to write efficient programs while appreciating hardware constraints encountered in modern computing environments during early academic and professional development.

  • Research Article
  • 10.30574/wjaets.2025.17.3.1563
OPTIMIZING NVIDIA® GEFORCE RTX™ 5090 &amp; "AMD RX 9070"for machine learning and artificial intelligence workload
  • Dec 31, 2025
  • World Journal of Advanced Engineering Technology and Sciences
  • Mohit Jain + 4 more

As consumer-grade GPUs have rapidly evolved, efforts have emerged to deploy these computational models for training and inference, typically handled by data center hardware. The paper explores optimization of two next-generation graphics computing units, the NVIDIA GeForce RTX 5090 and the AMD Radeon RX 9070, to optimize the new generation of ML and AI applications. We examine the internal compute pipelines, tensor/matrix acceleration capabilities, memory hierarchies, and software ecosystems (CUDA/cuDNN/TensorRT versus ROCm/MIOpen/HIP) that influence ML performance in a two-pronged architectural and empirical study. The convolutional networks, transformer models, diffusion architecture, and graph neural networks share a standard benchmarking model: training, inference latency, power consumption, precision scaling (FP32-INT8), and bottlenecks. The results of the experiment have demonstrated that the performance profiles of the RTX 5090 and the RX 9070 are different, i.e., the acceleration performance of mixed precision and kernel fusion is higher in the RTX 5090 as compared to the throughput performance of the RX 9070 in the BF16/INT8 workloads with the high memory-bandwidth utilization. Strategies for each platform. Platform-specific optimization strategies, such as kernel tuning, compiler optimization, memory prefetching, gradient checkpointing, and scaling to multiple GPUs, are developed and evaluated. Further, two case studies of real-world performance tuning of transformer fine-tuning and diffusion model inference are also presented. The findings highlight that hardware alone does not guarantee the best ML performance; effective optimization can deliver performance gains that are even more significant than raw compute alone. The paper will provide a step-by-step roadmap for practitioners, researchers, and engineers who may want to optimize the application of RTX 5090 and RX 9070 in artificial intelligence algorithms, as well as a future perspective on the standard models of unified programming on GPUs and emergent precision formats.

  • Research Article
  • 10.30574/ijsra.2025.17.3.3244
C compiler porting and optimization for the 32-Bit Loong Arch CPU
  • Dec 31, 2025
  • International Journal of Science and Research Archive
  • Md Shahariar Idris Robin + 2 more

The LoongArch instruction set architecture (ISA) has become a cornerstone in efforts to build a secure, autonomous, and high-performance domestic computing ecosystem. To make Loongson processors practical for real software deployment, a dependable and well-optimized compiler is essential—particularly for emerging 32-bit platforms such as LoongArch32R. This study develops a complete and reproducible workflow for adapting and optimizing the GNU C Compiler (GCC) for LoongArch32R, enabling reliable instruction generation and performance-focused code transformation. The work combines several technical components: validation of the GCC backend, execution through QEMU in both user-level and system-level environments, incorporation of the MOS teaching operating system with custom benchmark applications, detailed examination of LSX SIMD auto-vectorization, and the introduction of a prototype custom vector instruction (VCUBE.W) through assembler-level extension. A structured benchmarking suite—including matrix multiplication, prime sieve, STREAM-like memory workloads, and memory operations—was implemented to evaluate optimization levels and compiler behavior. Performance measurements were analyzed and visualized using Python-based graphing tools. The experimental results show clear runtime improvements from standard optimization flags and demonstrate partial vectorization benefits, verifying that the ported compiler is functional, stable, and capable of generating efficient LoongArch32R code. Overall, the framework produced in this work offers a practical foundation for future compiler development, educational use, and broader software ecosystem support for LoongArch-based systems.

  • Research Article
  • 10.71086/iajse/v12i4/iajse1285
Simulation-Driven Multi-Kernel FPGA Dataflow Optimization for Low-Latency and Resource-Efficient Digital Signal Processing Pipelines
  • Dec 30, 2025
  • International Academic Journal of Science and Engineering
  • Laura Virtanen + 1 more

Hardware acceleration in digital signal processing (DSP) applications required by edge computing, wireless communication, and multimedia analytics is becoming more and more apparent because of the high throughput needed and the low latency as well as resource usage required. FPGAs have provided a desirable platform to do such applications due to their reconfigurable architecture and capability of executing highly parallel processing pipelines. Most current DSP accelerators are however designed to optimise a single algorithm e.g. the fast fourier transform or filtering operations and hence not very flexible and therefore do not efficiently utilize resources when more than one kernel needs to be run in the same pipeline. In this paper, a multi-kernel FPGA dataflow optimization framework is suggested based on simulation and modelling a number of DSP kernels within one hardware pipeline. The methodology is an integration of dataflow graph modelling, latency-sensitive scheduling and resource sharing techniques with the aim of maximising the throughput and computational efficiency. The signal workload generator is a synthetic workload generator that mimics various conditions of the application and therefore makes it possible to evaluate the application without real-time data. Hardware simulation and estimation with synthesis are used to measure the latency, throughput, power consumption and the use of FPGA resources by the architecture. It was found that the presented multi-kernel pipeline architecture is much more efficient in terms of utilising pipelines and ranges of processing latency as compared to the traditional single-kernel accelerators. The results indicate that architectural exploration can offer a feasible approach to the development of scalable DSP accelerators applicable in edge computing and communication systems through simulation-based exploration.

  • Research Article
  • 10.1145/3786763
Scaling Inter-procedural Dataflow Analysis on the Cloud
  • Dec 26, 2025
  • ACM Transactions on Programming Languages and Systems
  • Zewen Sun + 12 more

Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow – BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-32572-z
Optimal compilation strategies for QFT circuits in neutral-atom quantum computing.
  • Dec 25, 2025
  • Scientific reports
  • Dingchao Gao + 3 more

Neutral-atom quantum computing (NAQC) offers distinct advantages such as dynamic qubit reconfigurability, long coherence times, and high gate fidelities, making it a promising platform for scalable quantum computing. Among existing implementations, the Dynamically Field-Programmable Qubit Array (DPQA) architecture has emerged as the most prominent NAQC platform, enabling large-scale, high-fidelity operations through dynamic atom rearrangement and global Rydberg excitation. Despite these strengths, efficiently implementing quantum circuits like the Quantum Fourier Transform (QFT) remains a significant challenge due to atom-movement overheads and connectivity constraints. This paper introduces optimal compilation strategies tailored to QFT circuits on the DPQA architecture, addressing these challenges for both linear and grid-like configurations. By minimizing atom movements, the proposed methods achieve theoretical lower bounds in movement counts while preserving high circuit fidelity. Comparative evaluations against state-of-the-art DPQA compilers demonstrate the superior performance of the proposed methods, which could serve as benchmarks for evaluating the performance of future DPQA compilers.

  • Research Article
  • 10.3390/electronics15010008
Research on Binary Decompilation Optimization Based on Fine-Tuned Large Language Models for Vulnerability Detection
  • Dec 19, 2025
  • Electronics
  • Yidan Wang + 3 more

The proliferation of binary vulnerabilities in the software supply chain has become a critical security challenge. Existing vulnerability detection approaches—including dynamic analysis, static analysis, and decompilation-assisted analysis—all suffer from limitations such as insufficient coverage, high false-positive and false-negative rates, or poor compatibility. Although decompilation technology can serve as a bridge connecting binary-code and source-code vulnerability detection tools, current schemes suffer from inadequate semantic restoration quality and lack of tool compatibility. To address these issues, this paper proposes LLMVulDecompiler, a binary decompilation model based on fine-tuned large language models designed to generate high-precision decompiled code that integrates directly with source-code static analysis tools. We construct a dedicated training and evaluation dataset that covers multiple compiler optimization levels (e.g., O0–O3) and a diverse set of program functionalities. We adopt a two-stage fine-tuning strategy that involves first building foundational decompilation capabilities, then enhancing vulnerability-specific features. Additionally, we design a low-cost inference pipeline and establish multi-dimensional evaluation criteria, including restoration similarity, compilation success rate, and functional correctness. Experimental results show that the model significantly outperforms baseline models in terms of average edit distance, compilation success rate, and black-box test pass rate on the HumanEval-C benchmark. In tests on 12 real-world CVE (Common Vulnerabilities and Exposures) instances, the approach achieved a detection accuracy of 91.7%, with substantially reduced false-positive and false-negative rates. This study demonstrates the effectiveness of specialized fine-tuning of large language models for binary decompilation and vulnerability detection, offering a new pathway for binary security analysis.

  • Research Article
  • 10.1145/3779444
REATA: An Efficient Vision Transformer Accelerator Featuring a Resource-Optimized Attention Design on Versal ACAP
  • Dec 11, 2025
  • ACM Transactions on Reconfigurable Technology and Systems
  • Wenbo Zhang + 4 more

Deploying Vision Transformers (ViTs) on edge devices poses significant challenges due to their high computational demands and memory access overheads, which severely hinder real-time inference efficiency. This paper proposes a modular and adaptive ViT acceleration architecture targeting the AMD Versal ACAP platform. By leveraging heterogeneous resource collaboration and fine-grained dataflow optimizations, the proposed design addresses performance bottlenecks effectively. We introduce a resource-efficient attention computation module that localizes self-attention operations within AI Engine (AIE) core clusters, thereby reducing inter-module communication and minimizing MAC resource usage. In parallel, a resource-aware multi-stage pipeline scheduling strategy dynamically partitions and parallelizes the computation-intensive feed-forward network (FFN), improving computation reuse and module-level coordination. The architecture integrates parameter tiling and a PLIO-based broadcasting mechanism to construct a decoupled compute-communication dataflow engine, alleviating memory bottlenecks. Experimental results on the Xilinx VCK5000 ACAP platform demonstrate that the proposed design achieves 33.2 TOPS throughput at INT8 precision—outperforming the state-of-the-art EQ-ViT accelerator by 27%—while maintaining a competitive efficiency of 510.6 GOPS/W. Scalability evaluations on ViT-Base and DeiT-Tiny confirm the design’s adaptability in edge scenarios, offering a resource-efficient and reconfigurable hardware paradigm for high-density Transformer inference.

  • Research Article
  • 10.36676/jmk.v5.i2.93
Compiler-Assisted Optimization Using Neural Code Embeddings for Heterogeneous Architectures
  • Dec 8, 2025
  • Journal of Multidisciplinary Knowledge
  • Matteo R Donelli

The growing reliance on heterogeneous hardware (CPUs, GPUs, TPUs, NPUs) complicates compiler optimization because traditional rule-based heuristics cannot capture subtle performance interactions across architectures. This paper introduces NEO-Opt, a compiler-assisted optimization framework that integrates neural code embeddings and predictive performance models into the LLVM toolchain. Instead of relying exclusively on manually engineered heuristics, NEO-Opt learns optimization preferences from large corpora of real workloads and micro-benchmarks. The system represents code fragments using graph-based embeddings derived from control-flow and data-flow structures, which are fed into a multi-task predictor that estimates latency, memory pressure, and accelerator utilization. Based on these predictions, NEO-Opt selects optimization passes and scheduling policies dynamically. Evaluation across mixed computing platforms shows average performance gains of 12–22% for GPU-intensive workloads and up to 30% for compute-bound CPU kernels. Case studies illustrate how learned embeddings capture optimization opportunities missed by conventional compilers. We also analyze limitations, including embedding drift and poor generalization to rarely used instructions.

  • Research Article
  • 10.1002/cpe.70456
Dynamic Load Balancing for Distributed Large Model Training: A Hybrid Framework of Gray Markov Chain and MDP
  • Dec 2, 2025
  • Concurrency and Computation: Practice and Experience
  • Yonggang Li + 5 more

ABSTRACT Large‐scale model training in distributed data centers plays a crucial role in deep learning. Still, it faces significant challenges, including resource fragmentation, low bandwidth utilization, and complex task flow management. The problem is exacerbated by high‐speed, high‐capacity parameter synchronization, often exceeding several hundred Gbps, which leads to reduced throughput and computational inefficiencies. To address these challenges, this paper proposes an innovative approach that combines data parallelism, model parallelism, and dynamic load balancing. By integrating a Gray Markov Chain (GMC) and Markov Decision Process (MDP) model, the approach dynamically schedules resources and balances computational loads. The GMC model is used to predict future node loads, facilitating optimal weight matrix decomposition, while the MDP model adjusts data transmission paths to optimize network traffic management. The combination of these two models enhances both resource allocation and data flow optimization. Experimental results demonstrate that this integrated approach significantly improves throughput, resource utilization, and computational efficiency compared to traditional methods. The findings suggest that this hybrid approach performs exceptionally well in optimizing large‐scale distributed training tasks in multidata‐center environments, significantly improving the scalability and performance of deep learning workloads. This research shows promising implications for enhancing the efficiency and effectiveness of distributed training systems in high‐demand applications.

  • Research Article
  • 10.21869/2223-1560-2025-29-3-99-112
Modeling and implementation of Common LISP functional language compiler
  • Nov 29, 2025
  • Proceedings of the Southwest State University
  • A A Chaplygin

Purpose of research is to create a compiler model for the functional language Common Lisp, implement this model, and test the compiler model using a target virtual machine to increase the execution speed of programs. Methods . A formal compiler model of the functional language Common Lisp was built using denotational semantics. Compilation takes place in several stages. At the first stage, the source language is transformed into an intermediate lambda language in which all macros are expanded, embedded forms are transformed into similar expressions, and variable names are replaced with local, global, and deep references. At the second stage, the expression in the intermediate language is transformed from a tree structure into a linear list of primitive instructions of the target virtual machine. Results . The resulting primitive instructions are encoded using a special assembler into numeric code for execution on the target virtual machine. The compilation also results in a list of constants and the amount of memory required for the compiled program to run. The target virtual machine consists of memory sections for the encoded program, constants, global variables, stack, list of activation frames, registers (accumulator, stack pointer, instruction pointer, current activation frame). Activation frames are array objects that store a pointer to the previous frame, the call depth level number, and local arguments. Garbage collection takes place using the tagging and cleaning method. Conclusion . As a result, a Common Lisp functional language compiler model was built and implemented. Compared to the interpreter, the speed of the program has increased by an average of 20 times. Further speed increases can be achieved by using various compiler optimizations at different stages. Of the simple optimizations, it can be noted: optimization of arithmetic expressions, elimination of unnecessary commands, simplification of expressions.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers