- Research Article
- 10.1145/3758321
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Gaurav Kumar + 4 more
Main memory plays a pivotal role in the storage of computational data in a wide range of applications, including highly sensitive assets such as banking transactions, cryptographic keys, and user credentials. However, memory systems remain vulnerable to advanced physical and side-channel attacks, including cold boot attacks that exploit residual data after power-down. To mitigate such risks, Intel’s DDR3 memory scrambler uses a Linear Feedback Shift Register (LFSR)-based stream cipher to obscure memory contents. Nevertheless, this mechanism has been shown to be susceptible to stencil attack, a cold boot technique that reconstructs the scrambling key by leveraging the linear and periodic nature of the keystream. This article proposes a novel, lightweight, and secure scrambling architecture based on a generic LFSR designed to enhance the security of DDR3 memory against cold boot attacks. The proposed generic LFSR-based mechanism eliminates differential keystream periodicity by introducing an address- and seed-dependent LFSR structure, thereby rendering differential key recovery techniques computationally infeasible. Furthermore, unlike traditional AES-based memory encryption that incurs high latency and area overhead, the proposed approach achieves comparable security guarantees with low hardware complexity and zero access latency. The hardware implementation results on the Xilinx VCU118 FPGA show that the proposed scheme consumes only 252 LUTs, 256 registers and 104 slices, comparable to the Intel DDR3 scrambler, while offering superior resilience against the cold boot, warm boot, and probing attacks. These results demonstrate the practicality of the proposed scheme for secure memory systems in resource-constrained environments.
- Research Article
- 10.1145/3762656
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Gabriele Tombesi + 5 more
Tiled accelerator architectures provide opportunities to optimize the performance of multi-model augmented and virtual reality (AR/VR) applications through intra-layer parallelism and inter-layer pipelining. However, balancing these two strategies is a difficult task that demands a flexible architecture to deploy models and an optimization approach, that is, capable of selecting an optimal strategy from an enormous mapping space. This article presents FLIP2M, a holistic solution for mapping multi-model AR/VR workloads on tiled architectures. FLIP2M consists of (1) FLIP, an acceleration fabric that supports a wide variety of optimizations through flexible on-chip communication, and (2) OASIS, an optimization framework based on dynamic and constraint programming, that is, capable of selecting an efficient strategy for mapping multi-model workloads onto FLIP. We demonstrate FLIP2M on an FPGA prototype of FLIP that features 36 accelerators and 7 DDR4 controllers. Using OASIS-generated mappings for three different multi-model AR/VR workloads, FLIP2M achieves up to 1.94× improvement in latency, 1.37× in energy, and 2.59× in energy-delay product relative to a FLIP baseline without intra-layer resource allocation flexibility and inter-layer pipelining.
- Research Article
- 10.1145/3759918
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Gaurav Narang + 5 more
Graph Neural Networks (GNNs) are made up of multiple layers, with each layer comprising of different compute kernels involving weight vectors and adjacency matrices of input graph dataset. These layers exhibit varying features such as sparsity, storage requirement, and impact on predictive accuracy. Non-volatile memory (NVM)-based 3D Processing-In-Memory (PIM) architectures offer a promising approach to accelerate GNN inferencing. However, NVM device-based crossbars suffer from various non-idealities that affect the overall predictive accuracy. In this work, we consider the problem of finding a suitable mapping of GNN layers to PIM-based processing elements (PEs) in a 3D manycore architecture such that the impact of crossbar non-idealities on predictive accuracy is minimized. We develop a framework called GINA, which leverages low-cost, approximate Hessian-based methodology to automatically determine the GNN layers that are critical for accuracy and find a suitable GNN layer to PE mapping. To tackle non-idealities and to exploit sparsity at the crossbar level, a subset of the full crossbar is activated in a cycle, referred to as Operation Unit (OU). However, OU configurations vary with the above-mentioned GNN layer features, time-dependent conductance drift, and input graph dataset. GINA learns to optimize the OU configuration for unseen datasets as a function of GNN layer features and time-dependent conductance drift. Our experimental results demonstrate that GINA-enabled 3D PIM architecture reduces the latency and energy by 7.4 imes and 13 imes on an average, respectively, compared to state-of-the-art PIM architectures without compromising the predictive accuracy. Finally, we demonstrate the applicability of GINA to Convolutional Neural Networks (CNNs) and Vision Transformers.
- Research Article
- 10.1145/3761812
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Srinivasan Subramaniyan + 1 more
GPUs have recently been adopted in many real-time embedded systems. However, existing GPU scheduling solutions are mostly open-loop and rely on the estimation of worst-case execution time (WCET). Although adaptive solutions, such as feedback control scheduling, have been previously proposed to handle this challenge for CPU-based real-time tasks, they cannot be directly applied to GPU, because GPUs have different and more complex architectures and so schedulable utilization bounds cannot apply to GPUs yet. In this article, we propose FC-GPU, the first Feedback Control GPU scheduling framework for real-time embedded systems. To model the GPU resource contention among tasks, we analytically derive a multi-input-multi-output (MIMO) system model that captures the impacts of task rate adaptation on the response times of different tasks. Building on this model, we design a MIMO controller that dynamically adjusts task rates based on measured response times. Our extensive hardware testbed results on an Nvidia RTX 3090 GPU and an AMD MI-100 GPU demonstrate that FC-GPU can provide better real-time performance even when the task execution times significantly increase at runtime.
- Research Article
1
- 10.1145/3762190
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Jiyong Kim + 6 more
State Space Model (SSM)-based machine learning architectures have recently gained significant attention for processing sequential data. Mamba, a recent sequence-to-sequence SSM, offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. This article presents eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. eMamba maximizes computational efficiency by replacing complex normalization layers with lightweight hardware-aware alternatives and approximating expensive operations, such as SiLU activation and exponentiation, considering the target applications. Then, it performs an approximation-aware neural architecture search (NAS) to tune the learnable parameters used during approximation. Evaluations with Fashion-MNIST, CIFAR-10, and MARS, an open-source human pose estimation dataset, show eMamba achieves comparable accuracy to state-of-the-art techniques using 1.63–19.9× fewer parameters. In addition, it generalizes well to large-scale natural language tasks, demonstrating stable perplexity across varying sequence lengths on the WikiText2 dataset. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology. Experimental results show 4.95–5.62× lower latency and 2.22–9.95× higher throughput, with 4.77× smaller area, 9.84× lower power, and 48.6× lower energy consumption than baseline solutions while maintaining competitive accuracy.
- Research Article
- 10.1145/3762189
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Suraj Meshram + 2 more
Cyber-Physical Systems controlling assembly line operations are central to manufacturing processes. Assembly line systems have diversified over time, depending on multiple factors, including the products being manufactured, the workstations and resources used, factory layouts, and so on. This diversity in assembly line configurations has added layers of complexity to the Assembly Line Balancing Problem (ALBP). While many powerful meta-heuristic techniques exist, their performance can vary significantly depending on the specific characteristics of the ALBP instance, such as the structure of the precedence graph, the distribution of task times, and the number of workstations. Recognizing the need for a more versatile solution, this article introduces a generic local search strategy called Flexible Meta-Heuristic (FMH), which includes a set of adjustable tuning parameters for adapting to specific scenarios. FMH combines and extends the strengths of Hill Climbing (HC), Simulated Annealing (SA), and Genetic Algorithm (GA) to provide effective solutions across a wide range of problems. Through extensive experiments using standard benchmarks and randomly generated datasets, FMH demonstrates high accuracy, deviating by at most 0.9% from best-known benchmark values. Additionally, FMH is significantly less resource-intensive, solving problems with up to 150 tasks in minutes where exact solvers can take hours, making it more scalable and applicable to large industrial scenarios. Our findings suggest that the algorithm’s flexibility and strategic hyper-parameter tuning contribute significantly to its effectiveness in solving diverse ALBPs.
- Research Article
- 10.1145/3761810
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Chi-Chieh Hung + 3 more
In key-value store systems, data security is often prioritized through compression and encryption of stored key-value pairs, ensuring protection against unauthorized access and breaches. However, these security measures introduce significant performance overheads, particularly during read operations, due to the need for decryption and decompression of data packs. This overhead is exacerbated in log-structured merge-tree (LSM-tree) based systems interfaced with NAND flash memory, where read amplification—caused by accessing entire compressed and encrypted units for a small subset of data—degrades performance. To address this challenge, we propose ReLoaD (Repacking Locality Data), a novel locality-based strategy designed to optimize read performance in encrypted key-value systems without compromising security or compression efficiency. ReLoaD leverages dynamic access pattern analysis to reorganize frequently co-accessed key-value pairs into contiguous storage packs, reducing the frequency of costly decryption and decompression operations. By introducing lightweight in-memory data structures—such as the PackInfo and Remapthl mapping tables—and innovative mechanisms like the locality-aware compactor and reloading repacker, ReLoaD enhances data locality within packs, minimizes I/O overhead, and increases the pack read ratio. Experimental evaluations using real-world workloads from X (formerly known as Twitter) and IBM, executed on the RocksDB platform, demonstrate that ReLoaD achieves up to a 38% improvement in read latency compared to state-of-the-art solutions like TinyEnc, while maintaining minimal impact on write performance. With a memory footprint of less than 3 MB, ReLoaD offers a scalable and practical approach to balancing security and performance, making it well-suited for modern secure storage systems deployed in resource-constrained environments.
- Research Article
- 10.1145/3763236
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Lars Willemsen + 5 more
We introduce and study transfer schedulability , a novel concept that describes how properties of a reference schedule derived from a scheduling algorithm \(\mathcal {A}\) are transferred onto another scheduling algorithm \(\mathcal {B}\) for a given task system and fixed arrival times. Specifically, we say schedulability is transferred from \(\mathcal {A}\) to \(\mathcal {B}\) if the task set is schedulable under \(\mathcal {B}\) whenever all deadlines are met in the reference schedule produced by \(\mathcal {A}\) . We identify a sufficient criterion for schedulability to be transferred on uniprocessor systems, which we verify with the Rocq proof assistant, and based on this criterion develop runtime mechanisms that enforce transfer schedulability. We relate transfer schedulability to prior approaches from the literature and demonstrate how the concept can be utilized to avoid timing anomalies and lower runtime scheduling overheads. We demonstrate that transfer schedulability can be utilized to prevent timing anomalies for non-preemptive scheduling, self-suspending tasks, and directed acyclic graph (DAG) tasks where the edges induce delays. Our evaluation on synthesized task sets shows improved schedulability compared to standard scheduling algorithms. We also evaluated the number of interventions necessary to transfer schedulability, and additionally demonstrate that the proposed runtime mechanisms eliminate timing anomalies (like a completely static, fully table-driven approach) while achieving a response-time distribution closely resembling those of classic dynamic, event-driven schedulers like EDF.
- Research Article
- 10.1145/3760746
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Dinesh Joshi + 2 more
In modern multi-processor systems-on-chips (MPSoCs), writebacks from the private caches to the shared cache can introduce significant performance bottlenecks, especially because multiple threads from different co-executing programs contend for the shared cache resources. Intelligent cache bypass decisions for writebacks help mitigate such contention and enhance the utilization of the shared cache. Most prior cache bypass strategies account for contention for shared cache capacity by focusing primarily on data reuse, with only recent research beginning to consider bandwidth contention also in dynamic bypass decisions. However, data sharing, a crucial characteristic of modern multithreaded workloads, remains largely overlooked by state-of-the-art cache bypass decisions. Bypassing highly shared cache lines can increase the volume of main memory accesses, potentially resulting in performance bottlenecks. We introduce SHARP, a novel cache bypass policy that incorporates three key factors: data sharing, contention, and data reuse, into its dynamic bypass decisions for cache writebacks. In addition to prioritizing the caching of data with high reuse, we prioritize the caching of data shared across multiple threads to enhance cache utilization. We dynamically modulate our bypass decisions, employing aggressive bypass for writebacks when shared cache contention is high, while employing conservative bypass when contention is low. Experiments across a diverse set of PARSEC workloads demonstrate that SHARP improves overall system throughput by 12% and 8% compared to the no-bypass baseline and the state-of-the-art bypass baseline, respectively. SHARP also reduces the overall cache energy consumption by 14% over the no-bypass baseline.
- Research Article
- 10.1145/3762654
- Sep 26, 2025
- ACM Transactions on Embedded Computing Systems
- Dean You + 10 more
Runahead execution is a technique to mask memory latency caused by irregular memory accesses. By pre-executing the application code during occurrences of long-latency operations and prefetching anticipated cache-missed data into the cache hierarchy, runahead effectively masks memory latency for subsequent cache misses and achieves high prefetching accuracy; however, this technique has been limited to superscalar out-of-order and superscalar in-order cores. For implementation in scalar in-order cores, the challenges of area-/energy-constraint and severe cache contention remain. Here, we build the first full-stack system featuring runahead, MERE , from SoC and a dedicated ISA to the OS and programming model. Through this deployment, we show that enabling runahead in scalar in-order cores is possible, with minimal area and power overheads, while still achieving high performance. By re-constructing the sequential runahead employing a hardware/software co-design approach, the system can be implemented on a mature processor and SoC. Building on this, an adaptive runahead mechanism is proposed to mitigate the severe cache contention in scalar in-order cores. Combining this, we provide a comprehensive solution for embedded processors managing irregular workloads. Our evaluation demonstrates that the proposed MERE attains 93.5% of a 2-wide out-of-order core’s performance while constraining area and power overheads below 5%, with the adaptive runahead mechanism delivering an additional 20.1% performance gain through mitigating the severe cache contention issues.