Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Year Year arrow
arrow-active-down-0
Publisher Publisher arrow
arrow-active-down-1
Journal
1
Journal arrow
arrow-active-down-2
Institution Institution arrow
arrow-active-down-3
Institution Country Institution Country arrow
arrow-active-down-4
Publication Type Publication Type arrow
arrow-active-down-5
Field Of Study Field Of Study arrow
arrow-active-down-6
Topics Topics arrow
arrow-active-down-7
Open Access Open Access arrow
arrow-active-down-8
Language Language arrow
arrow-active-down-9
Filter Icon Filter 1
Export
Sort by: Relevance
  • Research Article
  • 10.1145/3762994
Efficient Video Redaction at the Edge: Human Motion Tracking for Privacy Protection
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Haotian Qiao + 3 more

Computationally efficient, camera-based, real-time human position tracking on low-end, edge devices would enable numerous applications, including privacy-preserving video redaction and analysis. Unfortunately, running most deep neural network based models in real time requires expensive hardware, making widespread deployment difficult, particularly on edge devices. Shifting inference to the cloud increases the attack surface, generally requiring that users trust cloud servers, and increases demands on wireless networks in deployment venues. Our goal is to determine the extreme to which edge video redaction efficiency can be taken, with a particular interest in enabling, for the first time, low-cost, real-time deployments with inexpensive commodity hardware. We present an efficient solution to the human detection (and redaction) problem based on singular value decomposition (SVD) background removal and describe a novel time-efficient and energy-efficient sensor-fusion algorithm that leverages human position information in real-world coordinates to enable real-time visual human detection and tracking at the edge. These ideas are evaluated using a prototype built from (resource-constrained) commodity hardware representative of commonly used low-cost IoT edge devices. The speed and accuracy of the system are evaluated via a deployment study, and it is compared with the most advanced relevant alternatives. The multi-modal system operates at a frame rate ranging from 20 FPS to 60 FPS, achieves a wIoU 0.3 score (see Section 5.4 ) ranging from 0.71 to 0.79, and successfully performs complete redaction of privacy-sensitive pixels with a success rate of 91%–99% in human head regions and 77%–91% in upper body regions, depending on the number of individuals present in the field of view. These results demonstrate that it is possible to achieve adequate efficiency to enable real-time redaction on inexpensive, commodity edge hardware.

  • Research Article
  • 10.1145/3762648
A Load-Balanced Collaborative Repair Algorithm for Single-Disk Failures in Erasure Coded Storage Systems
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Zhijie Huang + 6 more

In large-scale cloud data centers and distributed storage systems, erasure coding is usually employed to enhance data availability and storage efficiency. However, with the explosive growth of data volume and the continuous expansion of storage system scale, traditional erasure coding techniques face significant challenges in handling single-disk failures. These challenges are primarily reflected in low data recovery efficiency and imbalanced system load distribution, which ultimately result in excessive I/O load and network bandwidth consumption, severely limiting the overall performance of the system. To address these issues, this article proposes a load-balanced data repair algorithm for single disk failures in erasure coded storage systems, called MNCR (Multi-Node Cooperative Repair). This algorithm improves data recovery efficiency in single-disk failure scenarios by minimizing data reading and inter-disk data transmission, using a cooperative repair strategy among disks. In addition, the algorithm designs a dynamic load balancing mechanism, which effectively resolves the issue of imbalanced data load distribution among disks during the repair process, thus avoiding performance bottlenecks caused by overloaded disks. Experimental results show that the MNCR algorithm significantly outperforms traditional methods in terms of repair efficiency and load balancing, providing an effective solution for single disk failure recoveries in erasure coding based large-scale storage systems.

  • Research Article
  • 10.1145/3762652
TimelyNet: Adaptive Neural Architecture for Autonomous Driving with Dynamic Deadline
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Jiale Chen + 4 more

To maintain driving safety, the execution of neural network-based autonomous driving pipelines must meet the dynamic deadlines in response to the changing environment and vehicle’s velocity. To this end, this article proposes a real-time neural architecture adaptation approach, called TimelyNet, which uses a supernet to replace the most compute-intensive neural network module in an existing end-to-end autonomous driving pipeline. From the supernet, TimelyNet samples subnets with varying inference latency levels to meet the dynamic deadlines during run-time driving without fine-tuning. Specifically, TimelyNet employs a one-shot prediction method that jointly uses a lookup table and an invertible neural network to periodically determine the optimal hyperparameters of a subnet to meet its execution deadline while achieving the highest possible accuracy. The lookup table stores multiple subnet architectures with different latencies, while the invertible neural network models the distribution of the optimal subnet architecture given the latency. Extensive evaluation based on hardware-in-the-loop CARLA simulations shows that TimelyNet-integrated driving pipelines achieve the best driving safety, characterized by the lowest wrong-lane driving rate and zero collisions, compared with several baselines, including the state-of-the-art driving pipelines.

  • Research Article
  • 10.1145/3760530
GNNmap: A Scalable Framework for GNN Deployment through Co-Optimized Graph Partitioning and Mapping
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Zimeng Fan + 1 more

Graph Neural Networks (GNNs) have become pivotal for analyzing relational data in embedded intelligent systems such as IOT devices. However, their deployment on resource-constrained devices faces critical barriers: traditional graph partitioning methods induce unbalanced computational loads due to rigid granularity, while hardware mapping strategies cause inefficient resource utilization under dynamic graph structures. These limitations conflict with the requirements of embedded systems for resource efficiency and scalability. To address this, we present GNNmap, a hardware-software co-design framework that synergizes multi-granular graph partitioning with topology-aware GNN mapping. The framework first reconstructs input graphs into balanced kernel groups comprising cohesive supernodes (corresponding to parallelizable subgraphs). By combining coarse-grained partitioning with fine-grained optimization, GNNmap ensures load balance while dramatically reducing cross-subgraph communication. Concurrently, a subgraph-PE mapping based on coarse-grained reconfigurable architectures (CGRAs) enables efficient graph-to-hardware matching through the joint modeling of graph topological features and hardware resource constraints. By dynamically coordinating graph reorganization and hardware resource allocation, GNNmap resolves the intrinsic mismatch between irregular graph computations and static hardware configurations. Experimental results demonstrate that GNNmap achieves improvements over existing works, improving inference performance by 1.47Ă— to 62.8Ă—, resource efficiency by 1.15Ă— to 3.06Ă—, and energy efficiency by 1.34Ă— to 3.50Ă—.

  • Research Article
  • 10.1145/3760781
FT-DAG: An Efficient Full-Topology DAG Generator with Controllable Parameters
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Yinjie Fang + 8 more

Directed Acyclic Graph (DAG) models are extensively utilized across fields such as automotive, wireless communication, and deep learning, to capture the inherent functional dependencies. Topology of DAG has a significant impact on the performance of scheduling and resource management algorithms applied to it. Hence, it is imperative to generate all DAG topologies within the parameter ranges pertinent to an application domain, for impartial evaluation of such algorithms. Unfortunately, the existing DAG generators that are capable of offering full topology coverage have limited scalability and controllable parameters. This work reports open-source FT-DAG, an efficient and formally verified full-topology DAG generator that is able to control all major parameters, including the longest length, shortest length, width, jump layer, jump level, in-degree, out-degree, shape value as well as the number of nodes and edges. Experiments show that when the number of nodes is larger than 20, FT-DAG provides at least two orders of magnitude speedup compared to the state of the art and more orders to other generators. FT-DAG scales to 100 nodes in a typical industrial case study within hours.

  • Research Article
  • Cite Count Icon 1
  • 10.1145/3761813
SAPar: A Surrogate-Assisted DNN Partitioner for Efficient Inferences on Edge TPU Pipelines
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Binqi Sun + 6 more

Pipelining deep neural networks (DNNs) across multiple Edge Tensor Processing Units (TPUs) can enhance on-device performance by increasing the capacity for DNN parameters caching and enabling pipeline parallelism. Effective deployment on pipelined Edge TPUs requires a partitioning tool to divide the DNN into segments, each assigned to a different Edge TPU in the pipeline. Achieving balanced workload distribution across these segments is crucial for optimal timing performance. However, workload balancing across Edge TPUs is challenging, as DNN execution time is influenced by proprietary hardware architecture and compiler internals, forming a black-box function inaccessible to partitioning tools. To address this challenge, this article introduces SAPar , a new surrogate-assisted DNN partitioner that integrates a neighborhood search engine with a surrogate-assisted evaluator for effective and efficient DNN partitioning. The neighborhood search engine systematically explores the decision space, guided by knowledge obtained from empirical insights and neighborhood evaluation feedback provided by the surrogate-assisted evaluator. The evaluator cooperatively applies an accurate yet time-consuming latency profiler and an efficient graph transformer-based surrogate model , achieving both precision and scalability. Experiments on real Edge TPU hardware demonstrate that SAPar achieves significantly better pipeline performance than Google’s current profiling-based partitioner with an 8.82× to 110× speedup in partitioning time. Moreover, SAPar reduces the bottleneck latency by 8.93% to 44.15% across five classic DNN models compared with a state-of-the-art reinforcement learning-based partitioner.

  • Research Article
  • 10.1145/3761807
Lemonade: Learning-based Heterogeneous Metadata Offloading for Disaggregated Memory
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Zeming Ma + 5 more

Direct Access (DA) in Disaggregated Memory (DM) is a promising solution that meets the high-performance requirements of AI applications. However, it lacks effective support for metadata management, making metadata operations the major bottleneck. To address this, we propose Lemonade, a l earning-based h e terogeneous m etadata o ffloadi n g for dis a ggregate d m e mory. Lemonade splits the metadata into highly regular and irregular ones, thus offloading the former into the client to avoid remote queries and enabling request redirection in the SmartNIC for the latter to ensure cost-effective correction and updates. Evaluations under microbenchmark and YCSB workloads indicate that Lemonade reduces latency by 72.8% and achieves a 1.43Ă— increase in throughput compared to the state-of-the-art systems.

  • Research Article
  • 10.1145/3762649
System Scenario-Based Design of the Last-Level Cache in Advanced Interconnect-Dominant Technology Nodes
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Mahta Mayahinia + 7 more

Feature size reduction of the front End of the Line (FEoL) and back End of the Line (BEoL) elements, i.e., transistors and interconnects, has been the main enabler of the next-generation computation systems. The decreasing trend of the cross-sectional area of the interconnect in advanced technology nodes, however, comes along with a drastic increase in the resistive parasitic, substantially impacting the overall energy efficiency and performance of the computer system. Mitigation of the high parasitic resistance within an advanced-node static RAM (SRAM)-based last-level cache (LLC) is the main target of this article. To achieve this target, we augment the LLC interconnect with some degree of reconfiguration by utilizing a dynamic segmented bus (DSB). With DSB, the interconnect segments that are most actively used for a given workload can be shortened, on average, contributing to a smaller capacitive load. Hence, the efficient reconfiguration of an LLC interconnect strongly depends on the LLC demands of the application. To account for this workload dependency, we design the required microarchitectural support in an end-to-end application-to-technology flow. By optimizing the overhead of DSB switches and additional hardware modules, the SRAM-based LLC with DSB-augmented intra-macro interconnect achieves 33% energy savings and 16% reduction in total access time across eight representative workloads, with a negligible area overhead of less than 0.4%.

  • Research Article
  • 10.1145/3758323
Towards Efficient Multi-Frame Clustering in Response Time Analysis for Large Object Communication
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Jonas Peeck + 2 more

In autonomous systems, growing sizes of application data, primarily related to perception tasks, have to be transmitted over communication infrastructures that provide higher data rates. Knowledge and exploitation of the clustered structure of multi-frame application data have proven to reduce the interference between different real-time critical communication streams, as recently shown for synchronous systems. However, the accompanying increase in frame numbers will foreseeably make the corresponding analysis impractical due to frame-dependent processing times. As a solution, based on the synchronous example mentioned, we demonstrate how to compose multi-frame transmissions to larger frame clusters in the analysis and decouple the analysis complexity from the application data sizes. Based on an automotive and industrial TSN use case, our results show comparable analytical response times and very low computation times at arbitrary data rates and sizes. It thus enables the deployment of efficient configuration methods that arbitrate large data samples transmitted through the network as one whole frame cluster. In addition, we made the developed analysis openly accessible.

  • Research Article
  • 10.1145/3759251
Deductive Verification of Cooperative RTOS Applications
  • Sep 26, 2025
  • ACM Transactions on Embedded Computing Systems
  • Philip Tasche + 2 more

Embedded systems are used in many safety-critical domains, including in medicine, traffic, and critical infrastructure. Due to the strict timing requirements such systems usually have to fulfill, they often run on real-time operating systems (RTOS). As the RTOS influences the function and the timing behavior of the system, it becomes important to rigorously ensure the correctness and safety of applications running on them while taking into account the semantics of the operating system. Existing verification approaches are either limited to specific RTOS components or based on explicit state space exploration techniques such as model checking, which do not scale well for concurrent or timed applications. In this article, we propose a deductive approach to verify crucial safety properties about applications written for the widely-used RTOS FreeRTOS using the VerCors verifier. Our key ideas are threefold: (1) We provide a formalization of a wide variety of FreeRTOS features and an automatic encoding of FreeRTOS applications for verification with VerCors. (2) We adapt and enhance an existing approach for automatic invariant generation to largely automate the typically high-effort verification process. (3) We present a systematic technique to verify both functional and timing-related properties of cooperative RTOS applications. We demonstrate the applicability of our approach on a FreeRTOS demo application as well as an adaptive cruise control system.