Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • New
  • Research Article
  • 10.1145/3793677
SHiELD: Functional Obfuscation of DSP Cores Using HLS Based One-Way Random Function and Reconfigurable Composite Switching Obfuscation Cells
  • Jan 24, 2026
  • ACM Transactions on Embedded Computing Systems
  • Anirban Sengupta + 2 more

Successful reverse engineering (RE) of digital signal processing (DSP) integrated circuits (ICs) by an attacker provides him/her a chance to pirate the DSP based intellectual property (IP) and insert malicious logic. It is thus central to devise low-cost sturdy functional obfuscation techniques for DSP cores that hinders RE attempt (or increases attackers effort manifold). There has been meager effort on devising robust high level synthesis (HLS) based functional obfuscation methodology that is low-cost/power. This paper presents a novel S ecure Hi gh L evel synthesis based functional obfuscation methodology for E nhanced security of D SP cores called ‘ SHiELD ’ that is driven through HLS based one-way random (OWR) function and reconfigurable composite switching obfuscation (CSO) cells, integrated with design space exploration process. The proposed approach offers security against different relevant attacks and in overall effectively thwarts RE attempt with the aid of proposed multi-key bit CSO cells, and custom OWR function. The results of the proposed approach in comparison with prior approaches yielded several magnitudes of higher security (robust obfuscation strength and lower probability of key retrieval) upto ∼10 154 (for FIR-2 benchmark calculated using eqn. (1)), lower power (of ∼10.6%) and reduction in design cost (of 0.91%).

  • New
  • Research Article
  • 10.1145/3793672
CADAS: Communication-Aware Dynamic Scheduler on CGRAs for Large-Volume and Real-Time Processing
  • Jan 24, 2026
  • ACM Transactions on Embedded Computing Systems
  • Jiahao Lin + 4 more

Modern data-intensive applications demand accelerators that can adapt to dynamic and high-throughput workloads. Coarse-Grained Reconfigurable Arrays (CGRAs) have emerged as promising candidates for such workloads due to their spatial architecture and run-time reconfigurability. However, ad-hoc hardware configurations and traditional static compilation techniques struggle to cope with the run-time irregularity and control-flow dynamism. This paper first presents a systematic design space exploration (DSE) to identify the optimized hardware configurations tailored to application-specific constraints, such as area budget, throughput requirement, and throughput efficiency. Then, it proposes a communication-aware dynamic scheduling approach built on a hardware/software co-design that combines preloading and scoreboard mechanisms to minimize reconfiguration overhead while maximizing interconnect bandwidth utilization. Evaluated on the optimized configurations and the respective spectrum sensing benchmarks, the proposed scheduling method achieves up to 1.6 × performance improvement over a baseline and 1.3 × over an adapted state-of-the-art (SOTA) dynamic scheduling strategy.

  • New
  • Research Article
  • 10.1145/3779218
A Design of Network Reconfigurable Universal CNN Accelerator Based on FPGA
  • Jan 23, 2026
  • ACM Transactions on Embedded Computing Systems
  • Wenhua Ye + 4 more

Convolutional Neural Network (CNN) has achieved great success in various fields of machine vision, such as image classification and recognition, image segmentation, and video analysis. In specific applications, it is often necessary to customize the network structure by altering image size, convolution kernel size, pooling size, network architecture, and the number of network layers. These customizations pose significant challenges to the architecture of CNN accelerator especially in real-time systems, and as the number of these network variables increases, the internal convolution calculations become extremely demanding. The inference speed and energy consumption of CNN are becoming more and more important, which requires the design of the accelerator to adapt to different network architecture and run efficiently. Field-programmable gate arrays (FPGAs) is an ideal choice for a CNN accelerator due to its high programmability and low power consumption. To address the aforementioned challenges, we present NRUCA, a novel network reconfigurable universal CNN accelerator based on FPGA. On one hand, we have designed a flexible architecture that can dynamically configure CNN’s network structure parameters. These parameters can be sent to the FPGA via a configuration file, enabling the same design to run different CNN networks and allowing to modify on the fly. This forms the basis of the architecture’s adaptability presented in this paper. On the other hand, we employ a multi-channel parallel and efficient pipeline matrix multiplication architecture to implement the convolution and fully-connected layers, which constitute the majority of the computational load in CNNs. By utilizing an innovative inverted matrix multiplication algorithm and an interleaving cache data method, we reduce the internal cache required for multi-channel data. Furthermore, most of the intermediate calculation data does not need to be output to DDR, significantly improving the operation efficiency of the accelerator. We also fully leverage the FPGA chip architecture to compile multiple calculation kernels, which can be flexibly scheduled and combined to cater to various application scenarios. Based on the flexible and efficient overall design architecture, the layout of various resources of FPGA is balanced, which enables the designed FPGA project to compile four acceleration kernels with the running clock up to 300MHz in Xilinx Vitis and test them on Xilinx Alveo U250. We verified the adaptability and acceleration capabilities of our FPGA project using LeNet, AlexNet, VGG11, VGG13, VGG16, and VGG19 networks. Experiments demonstrated that our FPGA architecture achieves 33X and 35X energy savings compared to the Intel Xeon 5220R CPU, and 1.08X and 1.05X energy savings compared to the Nvidia Tesla P100 GPU, when accelerating AlexNet and VGG16, respectively.

  • Research Article
  • 10.1145/3788870
Sometimes Painful but Promising: Feasibility and Trade-Offs of On-Device Language Model Inference
  • Jan 12, 2026
  • ACM Transactions on Embedded Computing Systems
  • Maximilian Abstreiter + 2 more

The rapid rise of Language Models (LMs) has expanded the capabilities of natural language processing, powering applications from text generation to complex decision-making. While state-of-the-art LMs often boast hundreds of billions of parameters and are primarily deployed in data centers, recent trends show a growing focus on compact models—typically under 10 billion parameters—enabled by techniques such as quantization and other model compression techniques. This shift paves the way for LMs on edge devices, offering potential benefits such as enhanced privacy, reduced latency, and improved data sovereignty. However, the inherent complexity of even these smaller models, combined with the limited computing resources of edge hardware, raises critical questions about the practical trade-offs in executing LM inference outside the cloud. To address these challenges, we present a comprehensive evaluation of generative LM inference on representative CPU-based and GPU-accelerated edge devices. Our study measures key performance indicators—including memory usage, inference speed, and energy consumption—across various device configurations. Additionally, we examine throughput-energy trade-offs, cost considerations, and usability, alongside an assessment of qualitative model performance. While quantization helps mitigate memory overhead, it does not fully eliminate resource bottlenecks, especially for larger models. Our findings quantify the memory and energy constraints that must be considered for practical real-world deployments, offering concrete insights into the trade-offs between model size, inference performance, and efficiency. The exploration of LMs at the edge is still in its early stages. We hope this study provides a foundation for future research, guiding the refinement of models, the enhancement of inference efficiency, and the advancement of edge-centric AI systems.

  • Research Article
  • 10.1145/3786342
Global Scheduling of Weakly-Hard Real-Time Tasks using Job-Level Priority Classes
  • Jan 7, 2026
  • ACM Transactions on Embedded Computing Systems
  • Victor Gabriel Moyano + 3 more

Real-time systems are intrinsic components of many pivotal applications, such as self-driving vehicles, aerospace and defense systems. The trend in these applications is to incorporate multiple tasks onto fewer, more powerful hardware platforms, e.g., multi-core systems, mainly for reducing cost and power consumption. Many real-time tasks, like control tasks, can tolerate occasional deadline misses due to robust algorithms. These tasks can be modeled using the weakly-hard model. Literature shows that leveraging the weakly-hard model can relax the over-provisioning associated with designed real-time systems. However, a wide-range of the research focuses on single-core platforms. Therefore, we strive to extend the state-of-the-art of scheduling weakly-hard real-time tasks to multi-core platforms. We present a global job-level fixed priority scheduling algorithm together with its schedulability analysis. The scheduling algorithm leverages the tolerable continuous deadline misses to assigning priorities to jobs. The proposed analysis extends the Response Time Analysis (RTA) for global scheduling to test the schedulability of tasks. Hence, our analysis scales with the number of tasks and number of cores because, unlike literature, it depends neither on Integer Linear Programming nor reachability trees. Schedulability analyses show that the schedulability ratio is improved by 40% comparing to the global Rate Monotonic (RM) scheduling and up to 60% more than the global EDF scheduling, which are the state-of-the-art schedulers on the RTEMS real-time operating system. Our evaluation on industrial embedded multi-core platform running RTEMS shows that the scheduling overhead of our proposal does not exceed 60 nanosecond.

  • Research Article
  • 10.1145/3777551
Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance
  • Dec 8, 2025
  • ACM Transactions on Embedded Computing Systems
  • Z.w Huang + 4 more

To address the information leak problem in cloud computing, privacy protection techniques are receiving widespread attention. Among them, the Paillier homomorphism algorithm is an effective one since it allows addition and scalar multiplication operations when information is in dencrypted state. However, its computational efficiency is limited by complex modulo operations due to the ciphertext expansion followed by encryption. To accelerate its decryption, the Chinese Remainder Theorem (CRT) is often used to optimize these modulo operations, which makes the decryption chain undesirably long in turn. To address this issue, we propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain by combining precomputed parameters and eliminating extra judgment operations introduced by Montgomery modular multiplications. These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm. Based on these improvements, we propose a highly parallel full-pipeline architecture to remove stalls caused by multiplier reuse in traditional modular exponentiation operations. This architecture also adopts some optimization methods, such as simplifying modular exponentiation units by dividing the exponent into segments and parallelizing data flow by multi-core instantiation. Finally, a high-throughput and efficient Paillier accelerator named MESA is implemented on the Xilinx Virtex-7 FPGA for evaluation. As the experimental result shows, it can complete a decryption within 0.577ms under a 100 MHz clock when using a 2048-bit key. Compared with previous works in the identical conditions, MESA can achieve a 1.16 × to 313.21 × increase in throughput, as well as 2.59% to 96.04% improvement in the Area Time Product (ATP).

  • Research Article
  • 10.1145/3779131
Layer-Reused Collaborative Scheduling for Container-Dependent Tasks in Industrial Real-Time Systems
  • Dec 1, 2025
  • ACM Transactions on Embedded Computing Systems
  • Haotong Zhang + 5 more

Renowned for their lightweight and portability, containers are increasingly deployed in edge computing to deliver low-latency and privacy-preserving computing services for industrial real-time systems. While layer reuse can potentially improve the execution efficiency of container-dependent tasks, the existing work suffers from two issues: 1) inefficient disk space utilization, which compromises image pull efficiency and reuse utility, and 2) inadequate consideration of resource capacity, leading to practical limitations. This paper addresses the container-dependent task scheduling problem in edge computing-assisted industrial real-time systems by introducing a novel Layer Reuse-based Collaborative Scheduling (LRCS) framework. First, to ensure practical applicability, we comprehensively consider resource capacities and the cost terms of the task completion time. Second, a framework that collaborates scheduling and execution nodes in a weakly coupled manner is proposed. At the execution nodes, LRCS maximizes disk utilization by evicting layers only when disk space is insufficient. The eviction is modeled as a 0-1 knapsack problem, where dynamic layer value assessment enhances the reuse rate. At the scheduling node, LRCS formulates container-dependent task scheduling as a multidimensional online bin-packing problem. A value-based learning algorithm optimizes long-term processing efficiency by adjusting the scheduling order of online tasks. Experimental validation on a testbed using real device data and comparisons with state-of-the-art layer reuse-based algorithms demonstrate that the proposed algorithm outperforms all baselines.

  • Research Article
  • 10.1145/3778862
Directed Acyclic Graph Topology Generators: A Survey
  • Nov 28, 2025
  • ACM Transactions on Embedded Computing Systems
  • Yinjie Fang + 2 more

Directed Acyclic Graphs (DAGs) are essential for modelling task dependencies across various domains, including industrial automation, autonomous systems, and many-core processors. DAG generators are widely used to evaluate the performance of scheduling algorithms. This survey systematically analyzes existing DAG generation algorithms, evaluating their search space, efficiency, and uniformity. We categorize existing DAG generators into three primary methodologies: Triangular Matrix-Based (TMB), Layer-by-Layer (LBL), and Poset-Based (POB) methods. Furthermore, we introduce a dedicated testing tool that quantitatively assesses various DAG generators’ performance. Experimental results show each DAG generator’s strengths and limitations, offering insights into their applicability in real-world scheduling and resource allocation problems. These findings provide a foundation for selecting suitable DAG generation methods for benchmarking scheduling heuristics and improving DAG task modelling.

  • Research Article
  • 10.1145/3777373
Coinf: QoS-aware DRL-based Inference Task Scheduling Framework with Batching Processing
  • Nov 17, 2025
  • ACM Transactions on Embedded Computing Systems
  • Guanglin Zhang + 3 more

The emergence of deploying Deep neural network (DNN) services on edge servers has spurred research into efficiently provisioning inference services. However, previous studies have neglected to consider the implications of different types of DNN and varying quality of service (QoS) requirements on QoS violation rates. In this paper, we propose a novel framework, named Coinf, for scheduling heterogeneous DNN inference tasks on edge servers. Coinf has the following four advantages to effectively handle attribute analysis, performance balancing, parallel execution, and model accuracy: 1) It enables efficient profiling of domain-specific attributes of various DNN tasks during the offline stage, achieved by constructing a regression model to predict the end-to-end latency of each task. 2) By utilizing the predicted execution time, Coinf achieves a commendable balance among inference latency, system throughput, and QoS violation rate. 3) It employs emerging deep reinforcement learning (DRL) to aggregate individual DNN tasks into batches, enabling concurrent parallel execution. 4) Coinf preserves the accuracies of the provided DNN models by not modifying them. Numerical experiments are constructed to validate the reliability and efficiency of Coinf in handling heterogeneous inference tasks.

  • Research Article
  • 10.1145/3776744
Sayram: A Hardware-software Co-design to Accelerate Wireless Baseband Processing
  • Nov 13, 2025
  • ACM Transactions on Embedded Computing Systems
  • Xinbing Zhou + 7 more

Micro base stations, with limited antennas and extensive deployment, require scaled-down hardware. Software-defined radio solutions (e.g., CPU, many-core systems, GPU) offer flexibility but incur high area and power costs, while traditional DSP lacks efficient acceleration for smaller configurations. The key challenge for micro base stations is achieving minimal area and power overhead while meeting 5G requirements. This paper presents a hardware-software co-designed architecture, Sayram, which minimizes overhead for 5G physical layer processing. Sayram integrates an instruction fusion mechanism, along with the compiler for simplified programming, a Vector Indirect Addressing Memory (VIAM) to minimize memory access cycles, and an improved vector register design to accelerate small-scale matrix computation, thereby improving overall processor efficiency. Operating at 1 GHz, Sayram achieves 158GOPS with a 1.18 mm² area, supporting 2T2R and 4T4R Physical Uplink Shared Channel (PUSCH) processing in single-core and dual-core modes, respectively. Evaluations show that Sayram’s area efficiency is 3 × and 9 × higher than traditional DSP and CGRA architectures, respectively, with power efficiency improvements of 44 × and 6 ×. Sayram’s energy and area efficiency surpass CPU solutions by orders of magnitude.