Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • Research Article
  • 10.1145/3766894
SATGuard: SAT-driven Countermeasures for Protecting Approximate Circuits from Hardware Trojan
  • Oct 9, 2025
  • ACM Transactions on Embedded Computing Systems
  • Vishesh Mishra + 3 more

Approximate arithmetic circuits have gained prominence in modern computing systems due to their ability to trade accuracy for improved performance and energy efficiency. However, their susceptibility to stealthy Trojan attacks poses a significant security concern. This work analyzes Trojan attacks on approximate circuits, focusing specifically on approximate adders and multipliers. We propose SATGuard, a boolean satisfiability (SAT)-based methodology to identify Trojan activating inputs (TAIs) for all approximate adder and multiplier families. We also claim that TAIs for approximate circuits are analogous to test input patterns for accurate circuits. Subsequently, we propose design-specific countermeasures to safeguard approximate circuits. The proposed countermeasures nullify the Hardware Trojan Horse (HTH)-based accuracy degradation, thus upholding the application-level accuracy requirements. We conduct experiments where potential Trojans are implanted into various approximate adders and multipliers. We evaluate their impact on the error metrics and the quality of results in real-world applications such as image processing and deep neural networks (DNNs). Our findings demonstrate that the proposed methodology successfully reverses the HTH-based accuracy degradation by 99.4%, and 99.8% in approximate adders and multipliers, respectively. This improvement is achieved with an average area overhead of 5.3% and a power-delay-product overhead of 7.6% in approximate adders and 1.7% and 1.9% in multipliers, respectively.

  • Research Article
  • 10.1145/3768154
Boosting Cryptographic ICs’ Side-Channel Resistance: A Formal Framework for Automatic Identification and Protection of Leaky Paths
  • Oct 8, 2025
  • ACM Transactions on Embedded Computing Systems
  • Qizhi Zhang + 5 more

Side-channel analysis (SCA) attacks pose a significant threat to cryptographic integrated circuits (ICs). While designers have endeavored to introduce various countermeasures during the IC development phase, many of these solutions incur substantial overheads in terms of area, power, and performance. Additionally, they often necessitate a full-custom circuit design for effective deployment. This issue arises due to the absence of systematic methodologies and analytical tools for circuit designers to accurately identify the sources of side-channel leakage within the hardware design. In this article, we propose the concept of side-channel tracking logic and, building upon this foundation, introduce a novel framework that seamlessly integrates with commercial design flows to automatically identify and safeguard leaky paths. Our approach begins by pinpointing partial logic cells that exhibit the highest information leakage using dynamic correlation analysis. Subsequently, formal-based leakage property checking constructs comprehensive leaky paths centered on these cells. In this process, side-channel tracking logic was proposed and applied for the first time to trace and extract side-channel leakage paths. Based on this, an automated formal modeling and leakage property verification tool was designed. Once these paths are discerned, we deploy apt hardware countermeasures, encompassing Boolean masking and random precharge, to eradicate information leakage along these routes. This framework has been experimentally validated across different encryption circuits and the efficacy of our methodology is corroborated through both simulated and real-world measurements on FPGA implementations. Empirical results showcase an enhancement of over 1000× in side-channel resistance, incurring a modest overhead of less than 6.53% across power, area, and performance metrics.

  • Research Article
  • 10.1145/3763793
<i>VoxDepth</i> : Rectification of Depth Images on Edge Devices
  • Oct 8, 2025
  • ACM Transactions on Embedded Computing Systems
  • Yashashwee Chakrabarty + 2 more

Autonomous mobile robots like self-flying drones and industrial robots heavily depend on depth images to perform tasks such as 3D reconstruction and visual SLAM. However, the presence of inaccuracies in these depth images can greatly hinder the effectiveness of these applications, resulting in sub-optimal results. Depth images produced by commercially available cameras frequently exhibit noise, which manifests as flickering pixels and erroneous patches. Machine Learning (ML)-based methods to rectify these images are unsuitable for edge devices that have very limited computational resources. Non-ML methods are much faster but have limited accuracy, especially for correcting errors that are a result of occlusion and camera movement. We propose a scheme called VoxDepth that is fast, accurate, and runs very well on edge devices such as the NVIDIA Jetson Nano board. It relies on a host of novel techniques: 3D point cloud construction and fusion, and using it to create a 2D template to fix erroneous depth images. VoxDepth shows superior results on both synthetic and real-world datasets. We specifically demonstrate a 31% improvement in quality as compared with state-of-the-art methods on real-world depth datasets, while maintaining a competitive frame rate of 27 FPS (frames per second).

  • Research Article
  • 10.1145/3769679
An Extensible Thread Throttling Method for Multiple OpenMP Parallel Programs
  • Sep 30, 2025
  • ACM Transactions on Embedded Computing Systems
  • Xiaoxuan Luo + 5 more

OpenMP is one of the most popular parallel frameworks in the HPC area. Many researchers have proposed OpenMP thread throttling techniques for searching the optimal configuration of parallelism to improve computational efficiency. However, existing research mainly focuses on the optimal solution and ignores the average performance of the program during the search process. In addition, there are various types of workloads in HPC production environments. The OpenMP configuration needs to be adjusted according to the real-time running status of programs. Otherwise, it may lead to a deviation of the actual improvement in the real-time environment from the theory. In this paper, we propose an OpenMP thread throttling method. The method uses the search results of historical workloads to train the performance vertex prediction model, quickly identifies the approximate range of the optimal number of threads for unknown workloads, and searches in a small range with a neighborhood-sampling-based bidirectional hill-climbing search algorithm. The method improves real-time optimization efficiency in HPC systems with multiple unknown loads. Through experiments, we demonstrate the advantages of our method compared to a variety of commonly used thread throttling methods. With minor differences in the optimal solutions, the average performance and convergence speed of our method during the search can be improved by up to 10.6% and 22.7% compared to the best method.

  • Research Article
  • 10.1145/3762650
Developing Deadlock-Free Routing Algorithms in Torus NoC: A Formal Approach
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Surajit Das + 2 more

Torus is a symmetric Network-on-Chip (NoC) topology with uniform node degree providing very high path diversity between a pair of source and destination. Moreover, the Wraparound Channels (WCs) in the torus can significantly reduce the hop count, thereby reducing overall communication latency. However, the WCs also create cyclic paths that may lead to a NoC deadlock. As a consequence, very few deadlock-free routing algorithms for torus-based NoC exist that do not have significant implementation overhead. Furthermore, the existing routing algorithms do not unlock the full potential of the torus-based NoC topology. In this work, we present a formal modeling-based technique for developing deadlock-free routing algorithms for torus-based NoC. This method systematically combines routing algorithms of mesh with WCs of torus to develop deadlock-free routing algorithms for torus. Using the proposed technique, we develop three novel routing algorithms and verify their deadlock-freedom using Directional Dependency Graph (DDG). We then evaluate the proposed routing algorithms using both synthetic and real traffic patterns. The primary objective of this work is to present a technique that can generate multiple routing algorithms and not the single best routing algorithm. Hence, we do not claim that the three proposed algorithms are the best-performing ones. Nevertheless, we show that they can save hop counts by more than 10% and latency by 8% compared to the competitive methods. The performance of our algorithms is comparable even with state-of-the-art Table-based rout- ing and deadlock recovery-based technique.

  • Research Article
  • 10.1145/3762655
THERMOS: Thermally-Aware Multi-Objective Scheduling of AI Workloads on Heterogeneous Multi-Chiplet PIM Architectures
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Alish Kanani + 5 more

Chiplet-based integration enables large-scale systems that combine diverse technologies, enabling higher yield, lower costs, and scalability, making them well-suited to AI workloads. Processing-in-Memory (PIM) has emerged as a promising solution for AI inference, leveraging technologies such as ReRAM, SRAM, and FeFET, each offering unique advantages and tradeoffs. A heterogeneous chiplet-based PIM architecture can harness the complementary strengths of these technologies to enable higher performance and energy efficiency. However, scheduling AI workloads across such a heterogeneous system is challenging due to competing performance objectives, dynamic workload characteristics, and power and thermal constraints. To address this need, we propose THERMOS, a thermally-aware, multi-objective scheduling framework for AI workloads on heterogeneous multi-chiplet PIM architectures. THERMOS trains a single multi-objective reinforcement learning (MORL) policy that is capable of achieving Pareto-optimal execution time, energy, or a balanced objective at runtime, depending on the target preferences. Comprehensive evaluations show that THERMOS achieves up to 89% faster average execution time and 57% lower average energy consumption than baseline AI workload scheduling algorithms with only 0.14% runtime and 0.022% energy overhead.

  • Research Article
  • 10.1145/3762188
A Discrete Partial Charging Enabled Dynamic Programming Strategy for Optimal Fixed-Route Electric Vehicle Charging
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Dipankar Mandal + 2 more

The rapid adoption of Electric Vehicles (EVs), driven by stringent environmental regulations and rising fuel costs, is reshaping the landscape of Vehicle Routing Problems (VRP). This shift has led to the Electric Vehicle Routing Problem (EVRP), which incorporates EV-specific operational constraints such as limited driving range, energy consumption, recharging strategies, and detour-related charging costs. The challenge becomes even more critical in modern mixed fleets , where Electric and Internal Combustion Engine Vehicles (ICEVs) coexist and must be co-routed efficiently. A widely adopted two-step strategy first uses Capacitated VRP (CVRP) algorithms to generate energy-oblivious routes, then makes EV routes energy-feasible via charging station insertion. While VRP and CVRP are extensively studied, methods for efficiently ensuring energy feasibility for EVs on fixed routes remain limited. This article introduces the Fixed Route Vehicle Charging Problem with Discrete Partial Charging (FRVCP-DPC) , extending FRVCP by allowing partial recharging up to predefined discrete levels. We develop a scalable optimal Dynamic Programming algorithm, Best Energy Feasible Route Generator (BEFRG) , to select detour points, charging stations, and charge levels that minimize total route time while maintaining energy feasibility. To evaluate BEFRG in dynamic traffic conditions, we introduce EFRGen , a traffic-aware EVRP simulator built on Simulation of Urban Mobility (SUMO) and OpenStreetMap (OSM). Experiments on the Montoya benchmark—spanning 120 instances with up to 320 demand points and 38 charging stations—show that BEFRG computes optimal solutions for all cases within one minute.

  • Research Article
  • 10.1145/3760386
Re-thinking Memory-Bound Limitations in CGRAs
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Xiangfeng Liu + 6 more

Coarse-Grained Reconfigurable Arrays (CGRAs) are specialized accelerators commonly employed to boost performance in workloads with iterative structures. Existing research typically focuses on compiler or architecture optimizations aimed at improving CGRA performance, energy efficiency, flexibility, and area utilization, under the idealistic assumption that kernels can access all data from Scratchpad Memory (SPM). However, certain complex workloads–particularly in fields like graph analytics, irregular database operations, and specialized forms of high-performance computing (e.g., unstructured mesh simulations)–exhibit irregular memory access patterns that hinder CGRA utilization, sometimes dropping below 1.5%, making the CGRA memory-bound. To address this challenge, we conduct a thorough analysis of the underlying causes of performance degradation, then propose a redesigned memory subsystem and refine the memory model. With both microarchitectural and theoretical optimization, our solution can effectively manage irregular memory accesses through CGRA-specific runahead execution mechanism and cache reconfiguration techniques. Our results demonstrate that we can achieve performance comparable to the original SPM-only system while requiring only 1.27% of the storage size. The runahead execution mechanism achieves an average 3.04× speedup (up to 6.91×), with cache reconfiguration technique providing an additional 6.02% improvement, significantly enhancing CGRA performance for irregular memory access patterns.

  • Research Article
  • 10.1145/3762994
Efficient Video Redaction at the Edge: Human Motion Tracking for Privacy Protection
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Haotian Qiao + 3 more

Computationally efficient, camera-based, real-time human position tracking on low-end, edge devices would enable numerous applications, including privacy-preserving video redaction and analysis. Unfortunately, running most deep neural network based models in real time requires expensive hardware, making widespread deployment difficult, particularly on edge devices. Shifting inference to the cloud increases the attack surface, generally requiring that users trust cloud servers, and increases demands on wireless networks in deployment venues. Our goal is to determine the extreme to which edge video redaction efficiency can be taken, with a particular interest in enabling, for the first time, low-cost, real-time deployments with inexpensive commodity hardware. We present an efficient solution to the human detection (and redaction) problem based on singular value decomposition (SVD) background removal and describe a novel time-efficient and energy-efficient sensor-fusion algorithm that leverages human position information in real-world coordinates to enable real-time visual human detection and tracking at the edge. These ideas are evaluated using a prototype built from (resource-constrained) commodity hardware representative of commonly used low-cost IoT edge devices. The speed and accuracy of the system are evaluated via a deployment study, and it is compared with the most advanced relevant alternatives. The multi-modal system operates at a frame rate ranging from 20 FPS to 60 FPS, achieves a wIoU 0.3 score (see Section 5.4 ) ranging from 0.71 to 0.79, and successfully performs complete redaction of privacy-sensitive pixels with a success rate of 91%–99% in human head regions and 77%–91% in upper body regions, depending on the number of individuals present in the field of view. These results demonstrate that it is possible to achieve adequate efficiency to enable real-time redaction on inexpensive, commodity edge hardware.

  • Research Article
  • 10.1145/3762648
A Load-Balanced Collaborative Repair Algorithm for Single-Disk Failures in Erasure Coded Storage Systems
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Zhijie Huang + 6 more

In large-scale cloud data centers and distributed storage systems, erasure coding is usually employed to enhance data availability and storage efficiency. However, with the explosive growth of data volume and the continuous expansion of storage system scale, traditional erasure coding techniques face significant challenges in handling single-disk failures. These challenges are primarily reflected in low data recovery efficiency and imbalanced system load distribution, which ultimately result in excessive I/O load and network bandwidth consumption, severely limiting the overall performance of the system. To address these issues, this article proposes a load-balanced data repair algorithm for single disk failures in erasure coded storage systems, called MNCR (Multi-Node Cooperative Repair). This algorithm improves data recovery efficiency in single-disk failure scenarios by minimizing data reading and inter-disk data transmission, using a cooperative repair strategy among disks. In addition, the algorithm designs a dynamic load balancing mechanism, which effectively resolves the issue of imbalanced data load distribution among disks during the repair process, thus avoiding performance bottlenecks caused by overloaded disks. Experimental results show that the MNCR algorithm significantly outperforms traditional methods in terms of repair efficiency and load balancing, providing an effective solution for single disk failure recoveries in erasure coding based large-scale storage systems.