Articles published on Runtime system
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
1206 Search results
Sort by Recency
- Research Article
- 10.1007/s42979-025-04577-y
- Dec 24, 2025
- SN Computer Science
- Rüdiger Nather + 1 more
Abstract Task-based parallel programming is a common approach to using modern multicore architectures efficiently. Hereby a programmer describes the computation as a set of possibly nested tasks and their dependencies. The dependencies can be dynamic, meaning that they can only be discovered at runtime. Dynamic dependencies can be expressed with the future construct, which comes in several variants. The C++ standard, for instance, defines (shared) futures that may be stored in data structures, accessed by multiple tasks, and filled through an associated promise that can be transferred between tasks. These futures cannot be instantiated with incomplete types, however. Recent algorithmic research suggested that both the features of C++ futures and support for incomplete types are necessary to enable nested futures for the synchronization of nested tasks. This paper describes the first implementation of such futures, called flex-futures, in the Taskflow programming system. It describes the corresponding extensions of the Taskflow programming model, user interface, and runtime system. The extended system is evaluated with a benchmark that mimics the LU decomposition of hierarchical matrices. We found that flex-futures come with a higher overhead than static dependencies, but still achieve comparable performance while offering greater flexibility.
- Research Article
- 10.1007/s42979-025-04405-3
- Nov 5, 2025
- SN Computer Science
- Jonas Posner + 5 more
Abstract Dynamic resource management enables supercomputing applications to change resource allocations at runtime. This capability promises significant improvements in application efficiency and overall supercomputer utilization. However, adoption is limited by insufficient support in resource managers and programming environments. Furthermore, developing resource-flexible applications introduces significantly higher programming complexity than their static counterparts. While MPI extensions have been proposed for resource flexibility, significant programmability challenges persist. The “Dynamic Processes with PSets (DPP)” design principles define programming model agnostic abstractions for dynamic resource control, and have been prototypically implemented by extending Open MPI and OpenPMIx (termed MPI-DPP ). MPI-DPP enables fine-grained process management but relies on low-level message-passing, complicating implementation of dynamic and irregular workloads. Asynchronous Many-Task (AMT) programming offers a compelling alternative. AMT splits computations into fine-grained tasks dynamically scheduled by the runtime system, enabling load balancing and responsiveness to resource changes. Although resource-flexible AMTs remain rare, GLB is a notable exception, offering automatic load balancing and dynamic resource capabilities. However, GLB is built on “APGAS for Java”, which is uncommon in HPC. We present DPP-GLB , a C++ AMT runtime that integrates GLB ’s high-level task abstraction and load balancing with the resource control capabilities of MPI-DPP . We evaluate DPP-GLB , GLB , and MPI-DPP on SuperMUC-NG, analyzing both programming complexity and runtime performance. Results show that GLB is easy to use, featuring built-in load balancing and resource flexibility. MPI-DPP offers superior performance for node changes, albeit at the cost of increased programming complexity. DPP-GLB achieves a balance of low programming complexity and efficient, scalable dynamic resource support.
- Research Article
- 10.1109/tmc.2025.3586797
- Nov 1, 2025
- IEEE Transactions on Mobile Computing
- Chengfei Lv + 5 more
ARSys: An Efficient and Cross-Platform Development, Deployment, and Runtime System for Mobile Augmented Reality
- Research Article
- 10.3390/en18205550
- Oct 21, 2025
- Energies
- Yifan Song + 6 more
In the heating, ventilation, and air conditioning (HVAC) systems of mushroom growing control rooms, traditional rule-based control methods are commonly adopted. However, these methods are characterized by response delays, leading to underutilization of energy-saving potential and energy costs that constitute a disproportionately high share of overall production costs. Therefore, minimizing the running time of the air conditioning system is crucial while maintaining the optimal growing environment for mushrooms. To address the aforementioned issues, this paper proposed a sensor optimization method based on the combination of principal component analysis (PCA) and information entropy. Furthermore, model predictive control (MPC) was implemented using a gated recurrent unit (GRU) neural network with an attention mechanism (GRU-Attention) as the prediction model to optimize the air conditioning system. First, a method combining PCA and information entropy was proposed to select the three most representative sensors from the 16 sensors in the mushroom room, thus eliminating redundant information and correlations. Then, a temperature prediction model based on GRU-Attention was adopted, with its hyperparameters optimized using the Optuna framework. Finally, an improved crayfish optimization algorithm (ICOA) was proposed as an optimizer for MPC. Its objective was to solve the control sequence with high accuracy and low energy consumption. The average energy consumption was reduced by approximately 11.2%, achieving a more stable temperature control effect.
- Research Article
- 10.1145/3765522
- Oct 13, 2025
- ACM Transactions on Embedded Computing Systems
- Alan Burns + 1 more
This article presents a general formal framework for describing the relationship between a criticality-aware scheduler, a set of application jobs that are assigned different criticality levels, and an environment that generates both work and faults that the run-time system must control. The proposed formalism extends the rely-guarantee approach, which facilitates formal reasoning about the functional behaviour of concurrent systems, to address real-time properties. The exposition of the general framework is supplemented by a seven step approach that enables it to be instantiated to deliver the formal specification of any proposed mixed-criticality scheduling protocol. The expressive power of the approach is explored via a non-trivial instantiation.
- Research Article
- 10.1145/3763133
- Oct 9, 2025
- Proceedings of the ACM on Programming Languages
- João Pereira + 4 more
Many imperative programming languages offer global variables to implement common functionality such as global caches and counters. Global variables are typically initialized by module initializers (e.g., static initializers in Java), code blocks that are executed automatically by the runtime system. When or in what order these initializers run is typically not known statically and modularly. For instance in Java, initialization is triggered dynamically upon the first use of a class, while in Go, the order depends on all packages of a program. As a result, reasoning modularly about global variables and their initialization is difficult, especially because module initializers may perform arbitrary side effects and may have cyclic dependencies. Consequently, existing modular verification techniques either do not support global state or impose drastic restrictions that are not satisfied by mainstream languages and programs. In this paper, we present the first practical verification technique to reason formally and modularly about global state and its initialization. Our technique is based on separation logic and uses module invariants to specify ownership and values of global variables. A partial order on modules and methods allows us to reason modularly about when a module invariant may be soundly assumed to hold, irrespective of when exactly the module initializer establishing it runs. Our technique supports both thread-local and shared global state. We formalize it as a program logic in Iris and prove its soundness in Rocq. We make only minimal assumptions about the initialization semantics, making our technique applicable to a wide range of programming languages. We implemented our technique in existing verifiers for Java and Go and demonstrate its effectiveness on typical uses cases of global state as well as a substantial codebase implementing an Internet router.
- Research Article
- 10.1145/3763180
- Oct 9, 2025
- Proceedings of the ACM on Programming Languages
- Humphrey Burchell + 1 more
Optimizing performance on top of modern runtime systems with just-in-time (JIT) compilation is a challenge for a wide range of applications from browser-based applications on mobile devices to large-scale server applications. Developers often rely on sampling-based profilers to understand where their code spends its time. Unfortunately, sampling of JIT-compiled programs can give inaccurate and sometimes unreliable results. To assess accuracy of such profilers, we would ideally want to compare their results to a known ground truth. With the complexity of today’s software and hardware stacks, such ground truth is unfortunately not available. Instead, we propose a novel technique to approximate a ground truth by accurately slowing down a Java program at the machine-code level, preserving its optimization and compilation decisions as well as its execution behavior on modern CPUs. Our experiments demonstrate that we can slow down benchmarks by a specific amount, which is a challenge because of the optimizations in modern CPUs, and we verified with hardware profiling that on a basic-block level, the slowdown is accurate for blocks that dominate the execution. With the benchmarks slowed down to specific speeds, we confirmed that Async-profiler, JFR, JProfiler, and YourKit maintain original performance behavior and assign the same percentage of run time to methods. Additionally, we identify cases of inaccuracy caused by missing debug information, which prevents the correct identification of the relevant source code. Finally, we tested the accuracy of sampling profilers by approximating the ground truth by the slowing down of specific basic blocks and found large differences in accuracy between the profilers. We believe, our slowdown-based approach is the first practical methodology to assess the accuracy of sampling profilers for JIT-compiling systems and will enable further work to improve the accuracy of profilers.
- Research Article
- 10.1145/3760257
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Serhan Gener + 7 more
Efficient memory management in heterogeneous systems is increasingly challenging due to diverse compute architectures (e.g., CPU, GPU, and FPGA) and dynamic task mappings not known at compile time. Existing approaches often require programmers to manage data placement and transfers explicitly, or assume static mappings that limit portability and scalability. This article introduces RIMMS (Runtime Integrated Memory Management System), a lightweight, runtime-managed, hardware-agnostic memory abstraction layer that decouples application development from low-level memory operations. RIMMS transparently tracks data locations, manages consistency, and supports efficient memory allocation across heterogeneous compute elements without requiring platform-specific tuning or code modifications. We integrate RIMMS into a baseline runtime and evaluate with complete radar signal processing applications across CPU+GPU and CPU+FPGA platforms. RIMMS delivers up to 2.43× speedup on GPU-based and 1.82× on FPGA-based systems over the baseline. Compared to IRIS, a recent heterogeneous runtime system, RIMMS achieves up to 3.08X speedup and matches the performance of native CUDA implementations while significantly reducing programming complexity. Despite operating at a higher abstraction level, RIMMS incurs only 1–2 cycles of overhead per memory management call, making it a low-cost solution. These results demonstrate RIMMS’s ability to deliver high performance and enhanced programmer productivity in dynamic, real-world heterogeneous environments.
- Research Article
- 10.7717/peerj-cs.3041
- Aug 29, 2025
- PeerJ Computer Science
- Oscar Peña-Cáceres + 4 more
In smart environments, autonomous systems often adapt their behavior to the context, and although such adaptations are generally beneficial, they may cause users to struggle to understand or trust them. To address this, we propose an explanation generation system that produces natural language descriptions (explanations) to clarify the adaptive behavior of smart home systems in runtime. These explanations are customized based on user characteristics and the contextual information derived from the user interactions with the system. Our approach leverages a prompt-based strategy using a fine-tuned large language model, guided by a modular template that integrates key data such as the type of explanation to be generated, user profile, runtime system information, interaction history, and the specific nature of the system adaptation. As a preliminary step, we also present a conceptual model that characterize explanations in the domain of autonomous systems by defining their core concepts. Finally, we evaluate the user experience of the generated explanations through an experiment involving 118 participants. Results show that generated explanations are perceived positive and with high level of acceptance.
- Research Article
- 10.1145/3747529
- Aug 5, 2025
- Proceedings of the ACM on Programming Languages
- Serkan Muhcu + 3 more
While enabling use cases such as backtracking search and probabilistic programming, multiple resumptions have the reputation of being incompatible with efficient implementation techniques, such as stack switching. This paper sets out to resolve this conflict and thus bridge the gap between expressiveness and performance. To this end, we present a compilation strategy and runtime system for lexical effect handlers with support for multiple resumptions and stack-allocated mutable state. By building on garbage-free reference counting and associating stacks with stable prompts, our approach enables constant-time continuation capture and resumption when resumed exactly once, as well as constant-time state access. Nevertheless, we also support multiple resumptions by copying stacks when necessary. We practically evaluate our approach by implementing an LLVM backend for the Effekt language. A performance comparison with state-of-the-art systems, including dynamic and lexical effect handler implementations, suggests that our approach achieves competitive performance and the increased expressiveness only comes with limited overhead.
- Research Article
- 10.1177/10943420251363435
- Jul 29, 2025
- The International Journal of High Performance Computing Applications
- Nicolas Nytko + 4 more
Legacy codes are in ubiquitous use in scientific simulations; they are well-tested and there is significant time investment in their use. However, one challenge is the adoption of new, sometimes incompatible computing paradigms, such as GPU hardware. In this paper, we explore using automated code translation to enable execution of legacy multigrid solver code on GPUs without significant time investment and while avoiding intrusive changes to the codebase. We developed a thin, reusable translation layer that parses Fortran 2003 at compile time, interfacing with the existing library Loopy ( Klöckner, 2014 ) to transpile to C ++ /GPU code, which is then managed by a custom MPI runtime system that we created. With this low-effort approach, we are able to achieve a payoff of an approximately 2–3× speedup over a full CPU socket, and 6× in multi-node settings.
- Research Article
- 10.1145/3750448
- Jul 24, 2025
- ACM Transactions on Architecture and Code Optimization
- Neel Patel + 2 more
Recent chip multiprocessors incorporate several on-chip accelerators, marking the beginning of the Accelerated Chip Multi-Processor (XMP) era in datacenters. Despite the close proximity of accelerators and general-purpose cores, offloading functions to accelerators may not always be beneficial. Offloading to hardware accelerators can introduce several end-to-end overheads that can negate the speedup of the accelerable function. In this paper, we design RACER, a hardware architecture and runtime system that evades the danger of end-to-end slowdowns when using hardware acceleration. RACER leverages a low-overhead interface between general-purpose cores and on-chip accelerators, fine-grained context switching, accelerator-initiated preemption, and seamless data motion between general-purpose cores and accelerators to improve the performance of workloads that use on-chip accelerators. We evaluate RACER on five representative request processing workloads featuring diverse memory access patterns, accelerable functions, and compute intensities. RACER improves the performance of hardware acceleration on a real XMP by an average of 1.31 × on a range of diverse workloads and guarantees that accelerator offloads never cause slowdowns.
- Research Article
1
- 10.7717/peerj-cs.2966
- Jul 11, 2025
- PeerJ Computer Science
- Paul Cardosi + 1 more
Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. In this article, we present Specx, a task-based runtime system written in modern C++. Specx supports distributed heterogeneous computing by simultaneously exploiting central processing units (CPUs) and graphics processing units (GPUs) (CUDA/HIP) and incorporating communication into the task graph. We describe the specificities of Specx and demonstrate its potential by running parallel applications.
- Research Article
- 10.1109/tc.2025.3558042
- Jul 1, 2025
- IEEE Transactions on Computers
- Guoqing Xiao + 4 more
DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs
- Research Article
- 10.54097/nbxhw030
- Jun 30, 2025
- Journal of Computing and Electronic Information Management
- Xiaojun Li + 1 more
The research aims to integrate cloud computing and Unified Modeling Language (UML) technology to optimize enterprise project information management systems. The present work summarizes some recent literature on cloud computing and UML technology and analyzes the functional requirements of enterprise project information management systems. Then, a small financial company's project information management experience is selected as an example for a case study. Besides, the system load balancing algorithm based on Hadoop and the HBase distributed data storage network are constructed through UML software modeling. The experimental results indicate that the system running time span of the information management system built based on cloud computing and UML technology attains 2,750ms under 400 tasks, higher than other algorithms. Meanwhile, the query time of the HBase distributed database network can improve the data transmission efficiency to a certain extent. This research has practical reference and application value in enhancing the efficiency of enterprise project information search and management.
- Research Article
- 10.52710/cfs.854
- May 31, 2025
- Computer Fraud and Security
- Balaramakrishna Alti
AI-Powered Governance-as-Code for Secure Linux Operations in FinTech Infrastructure
- Research Article
- 10.1145/3725985
- May 31, 2025
- ACM Transactions on Computer Systems
- Sheng Qi + 3 more
Serverless computing separates function execution from state management. Simple retry-based fault tolerance might corrupt the shared state with duplicate updates. Existing solutions employ log-based fault tolerance to achieve exactly-once semantics, where every single read or write to the external state is associated with a log for deterministic replay. However, logging is not a free lunch, which introduces considerable overhead to stateful serverless applications. We present Halfmoon, a serverless runtime system for fault-tolerant stateful serverless computing. Our key insight is that it is unnecessary to symmetrically log both reads and writes. Instead, it suffices to log either reads or writes, i.e., asymmetrically. We design two logging protocols that enforce exactly-once semantics while providing log-free reads and writes, which are suitable for read- and write-intensive workloads, respectively. We theoretically prove that the two protocols are log-optimal , i.e., no other protocols can achieve lower logging overhead than our protocols. We provide a criterion for choosing the right protocol for a given workload, and a pauseless switching mechanism to switch protocols for dynamic workloads. We implement a prototype of Halfmoon. Experiments show that Halfmoon achieves 20%–40% lower latency and 1.5–4.0× lower logging overhead than the state-of-the-art solution Boki.
- Research Article
- 10.63561/jmsc.v2i3.856
- May 30, 2025
- Faculty of Natural and Applied Sciences Journal of Mathematical and Statistical Computing
- Joseph Bamikole Olojido + 1 more
As distributed systems become more prevalent, the frequency of distributed attacks—such as distributed denial-of-service (DDoS) and worms—is increasing. Traditional intrusion detection systems are struggling to efficiently identify and report threats in a timely manner. Consequently, various Distributed Intrusion Detection Systems (DIDS) utilizing machine learning algorithms have been implemented. However, their effectiveness has been limited due to high computational costs and suboptimal accuracy levels. This paper aims to enhance the Fusion-Based Data Mining Model for Intrusion Detection in Distributed Environments. Two well-known distributed attack datasets, NSL-KDD’15 and UNSW-NB’15, were utilized in this study. (Fusion-Based Data Mining Model (FBDMM)) was chosen as the evaluation framework due to its widespread use. To minimize computational costs, both Principal Component Analysis (PCA) and Information Gain Ratio (IGR) were employed to extract the five most significant features from each dataset. Classifiers such as Support Vector Machines (SVM), Naïve Bayes (NB), and Multilayer Perceptron (MLP) were The hybridized using a Voting Classification technique to boost accuracy. The The hybridized Data Mining Model (Fusion-Based Data Mining Model (FBDMM)) comprised six classifiers: PCA+SVM, IGR+SVM, PCA+NB, IGR+NB, PCA+MLP, and IGR+MLP. The evaluation results were compared across these six classifiers based on accuracy (ACC), detection rate (DR), and false alarm rate (FAR). Computational costs, measured in System Running Time (SRT), were compared between five-feature and full-feature sets: forty-one features for NSL-KDD’15 and forty-nine features for UNSW-NB’15. The Fusion-Based Data Mining Model (FBDMM) achieved ACC, DR, and FAR values of 77.78, 96.98, and 2.55, respectively, while the highest performance among individual classifiers for NSL-KDD’15 was 72.17, 92.29, and 2.71. For UNSW-NB’15, the Fusion-Based Data Mining Model (FBDMM) recorded ACC, DR, and FAR values of 85.58, 95.98, and 3.35, respectively, with the best performance from individual classifiers being 82.88, 97.23, and 4.66. The SRT for NSL-KDD’15 was 10 seconds with five features and 5,200 seconds with forty-one features, while for UNSW-NB’15, it was 9 seconds with five features and 68,000 seconds with forty-nine features. The findings indicate that fusion-based Data Mining Model outperforms existing data mining models used in Distributed Intrusion Detection Systems in terms of both accuracy and computational cost. Therefore, fusion-based Data Mining Model is recommended for use in Distributed Intrusion Detection.
- Research Article
- 10.1145/3742939.3742951
- May 28, 2025
- ACM SIGAda Ada Letters
- Sara Royuela + 5 more
The number and diversity of embedded Field- Programmable Gate Arrays (FPGAs) Multi-Processor Systems On Chip (MPSoCs) in modern satellites is increasing, and so is the complexity and cost of using them efficiently (i.e., optimally exploiting the available resources) and safely (i.e., complying with the applicable safety and availability constraints). Programming languages traditionally used in critical real-time systems have yet to be designed to address the extreme parallelism of modern platforms. To address this limitation, OpenMP, the de-facto standard for exploiting parallelism in shared-memory systems in the HPC domain, is increasingly considered a suitable solution in critical domains. OpenMP implements a comprehensive set of computation models (e.g., data and task parallelism, host and accelerator support), comes with an extensive set of assets (e.g., tools, libraries), and supports a large set of CPU and accelerator devices (e.g., GR740, MPPA, NVIDIA Jetson and Xilinx Ultrascale+). Despite preliminary analysis proving the productivity and efficiency of OpenMP in the space, automotive and railway domains, some challenges must be addressed. This paper introduces LIONESS, a project funded by the European Space Agency (ESA) proposing an advanced OpenMP framework that combines enhancements in the parallel programming model with adapted compiler and runtime systems to provide benefits along two axes: (1) resilience, through providing fault-tolerance techniques, and (2) heterogeneity, through enabling the design space exploration of multiple deployment configurations considering multi-cores and accelerator devices.
- Research Article
1
- 10.1007/s10817-025-09721-0
- May 14, 2025
- Journal of Automated Reasoning
- Sheera Shamsu + 5 more
The OCaml programming language finds application across diverse domains, including systems programming, web development, scientific computing, formal verification, and symbolic mathematics. OCaml is a memory-safe programming language that uses a garbage collector (GC) to free unreachable memory. It features a low-latency, high-performance GC, tuned for functional programming. The GC has two generations—a minor heap collected using a copying collector and a major heap collected using an incremental mark-and-sweep collector. Alongside the intricacies of an efficient GC design, OCaml compiler uses efficient object representations for some object classes, such as interior pointers for supporting mutually recursive functions, which further complicates the GC design. The GC is a critical component of the OCaml runtime system, and its correctness is essential for the safety of OCaml programs. In this paper, we propose a strategy for crafting a correct, proof-oriented GC from scratch, designed to evolve over time with additional language features. Our approach neatly separates abstract GC correctness from OCaml-specific GC correctness, offering the ability to integrate further GC optimizations, while preserving core abstract GC correctness. As an initial step to demonstrate the viability of our approach, we have developed a verified stop-the-world mark-and-sweep GC for OCaml. The approach is fully mechanized in F* and its low-level subset Low*. We use the KaRaMel compiler to compile Low* to C, and integrate the verified GC with the OCaml runtime. Our GC is evaluated against off-the-shelf OCaml GC and Boehm–Demers–Weiser conservative GC, and the experimental results show that verified OCaml GC is competitive with the standard OCaml GC.