- Research Article
- 10.1145/3778862
- Nov 28, 2025
- ACM Transactions on Embedded Computing Systems
- Yinjie Fang + 2 more
Directed Acyclic Graphs (DAGs) are essential for modelling task dependencies across various domains, including industrial automation, autonomous systems, and many-core processors. DAG generators are widely used to evaluate the performance of scheduling algorithms. This survey systematically analyzes existing DAG generation algorithms, evaluating their search space, efficiency, and uniformity. We categorize existing DAG generators into three primary methodologies: Triangular Matrix-Based (TMB), Layer-by-Layer (LBL), and Poset-Based (POB) methods. Furthermore, we introduce a dedicated testing tool that quantitatively assesses various DAG generators’ performance. Experimental results show each DAG generator’s strengths and limitations, offering insights into their applicability in real-world scheduling and resource allocation problems. These findings provide a foundation for selecting suitable DAG generation methods for benchmarking scheduling heuristics and improving DAG task modelling.
- Research Article
- 10.1145/3777373
- Nov 17, 2025
- ACM Transactions on Embedded Computing Systems
- Guanglin Zhang + 3 more
The emergence of deploying Deep neural network (DNN) services on edge servers has spurred research into efficiently provisioning inference services. However, previous studies have neglected to consider the implications of different types of DNN and varying quality of service (QoS) requirements on QoS violation rates. In this paper, we propose a novel framework, named Coinf, for scheduling heterogeneous DNN inference tasks on edge servers. Coinf has the following four advantages to effectively handle attribute analysis, performance balancing, parallel execution, and model accuracy: 1) It enables efficient profiling of domain-specific attributes of various DNN tasks during the offline stage, achieved by constructing a regression model to predict the end-to-end latency of each task. 2) By utilizing the predicted execution time, Coinf achieves a commendable balance among inference latency, system throughput, and QoS violation rate. 3) It employs emerging deep reinforcement learning (DRL) to aggregate individual DNN tasks into batches, enabling concurrent parallel execution. 4) Coinf preserves the accuracies of the provided DNN models by not modifying them. Numerical experiments are constructed to validate the reliability and efficiency of Coinf in handling heterogeneous inference tasks.
- Research Article
- 10.1145/3776744
- Nov 13, 2025
- ACM Transactions on Embedded Computing Systems
- Xinbing Zhou + 7 more
Micro base stations, with limited antennas and extensive deployment, require scaled-down hardware. Software-defined radio solutions (e.g., CPU, many-core systems, GPU) offer flexibility but incur high area and power costs, while traditional DSP lacks efficient acceleration for smaller configurations. The key challenge for micro base stations is achieving minimal area and power overhead while meeting 5G requirements. This paper presents a hardware-software co-designed architecture, Sayram, which minimizes overhead for 5G physical layer processing. Sayram integrates an instruction fusion mechanism, along with the compiler for simplified programming, a Vector Indirect Addressing Memory (VIAM) to minimize memory access cycles, and an improved vector register design to accelerate small-scale matrix computation, thereby improving overall processor efficiency. Operating at 1 GHz, Sayram achieves 158GOPS with a 1.18 mm² area, supporting 2T2R and 4T4R Physical Uplink Shared Channel (PUSCH) processing in single-core and dual-core modes, respectively. Evaluations show that Sayram’s area efficiency is 3 × and 9 × higher than traditional DSP and CGRA architectures, respectively, with power efficiency improvements of 44 × and 6 ×. Sayram’s energy and area efficiency surpass CPU solutions by orders of magnitude.
- Research Article
- 10.1145/3774649
- Nov 11, 2025
- ACM Transactions on Embedded Computing Systems
- Muhammad Danish Tehseen + 3 more
In this work, we present our intelligent SSD, SeeSSD , an energy-efficient computational SSD for a real-time object detection system. SeeSSD embeds an FPGA-based CNN processing engine and the firmware that performs the convolutional operation on the target image. SeeSSD processes the image data at the storage before sending it to the host. This reduces the amount of data transferred to the host and lowers the data movement overhead, thus reducing transfer time and saving power. By using our SeeSSD system and YOLO_Embed, an object detection neural network model, we are able to outperform the fastest YOLO model for an embedded controller, YOLO-Lite, in terms of performance, accuracy, and energy efficiency. YOLO (You Only Look Once) models are a series of one-stage object detection neural models that have become very popular due to their fast speed and high accuracy. The contribution of this work includes designing and implementing our SeeSSD system with a lightweight object detection model, YOLO_Embed, for reducing the data movement overhead, performing real-time inference, and lowering the overall power consumption. We implemented the entire software stack associated with the SeeSSD system; on-device CNN acceleration engine implemented on FPGA, object identification interface for SeeSSD using YOLO_Embed, and embedded software layer in SeeSSD for on-device convolutional processing. We calculated our YOLO_Embed model’s accuracy on object detection dataset benchmarks such as PASCAL VOC 2012, which came out to be 38.1% mAP (mean Accuracy Precision). Our system was able to perform inference in 0.21 seconds while reducing the power consumption by approximately 1.2 × and 1.4 × for CPU-Only and CPU+GPU systems, respectively. We were also able to reduce the data movement overhead by 24 × for a single target image.
- Research Article
- 10.1145/3776742
- Nov 11, 2025
- ACM Transactions on Embedded Computing Systems
- Yu-Zheng Su + 2 more
Mobile applications have been seamlessly integrated into our daily lives. When using mobile devices, the energy efficiency of these applications plays a pivotal role in enhancing the user experience. However, it is noteworthy that incorporating power conservation strategies into the toolkit of user-interface (or UI) developers for mobile applications receives almost none research attention. To address the unique requirements for UI developers, this manuscript studies the fusion of power conservation techniques and UI guidance principles to formulate an innovative framework aimed at conserving power consumption within UI. The power conservation framework begins with the extraction of displayed component configuration, drawing from UI previews without depending on any development environment and deployment equipment, during the development phase. Subsequently, we evaluate the UI guidance of the displayed components, taking into consideration the human visual systems. To recommend a power-saving configuration to developers, the final step generates a power-saving configuration that not only curtails power consumption but also preserves the global and local guidance. To validate the efficacy of our framework, we conducted evaluations using eight distinct UI previews, including light and dark modes, on a commercial smart phone. The results obtained from these evaluations are very promising.
- Research Article
- 10.1145/3773032
- Nov 11, 2025
- ACM Transactions on Embedded Computing Systems
- Anuj Justus Rajappa + 7 more
Hyperdimensional Computing (HDC) is an emerging AI algorithm, touted to be an efficient, neuro-inspired and reliable alternative to neural networks for Edge AI. HDC utilizes hypervectors with several thousand elements; the number of elements in these hypervectors denotes the HDC dimension. This dimension can be optimized for improving the efficiency and reliability of HDC inference against errors such as bit-flips, which can be caused by environmental radiation-induced soft errors. We hypothesize that, by reducing the runtime chip area and execution time utilized by HDC inference through lowering dimensionality, both efficiency and reliability against soft error-induced bit-flips can be simultaneously improved while trading off a negligible amount of accuracy and error threshold. We tested our hypothesis by executing an HDC inference algorithm with two different dimension values, 10000 (10k) and 1024, on a commercially available, low-power, bare-metal ARM platform with a Cortex-M4 processor. We conducted the efficiency analysis by measuring the CPU cycles and energy required for executing the algorithm, and the reliability analysis using real-world atmospheric-like neutron radiation from the ChipIr facility in Oxfordshire, UK. Analyses revealed that, by lowering the HDC dimension from 10k to 1024, the reliability of HDC inference against soft error-induced bit-flips was 3.5 times better and efficiency improved by more than 16 times. This innovative observation contrasts the prevailing understanding in the community that increasing the HDC dimension always improves robustness or reliability. To the best of our knowledge, our work is the first to study the reliability of HDC inference using real-world radiation.
- Research Article
- 10.1145/3774891
- Nov 5, 2025
- ACM Transactions on Embedded Computing Systems
- Jiajie Wang + 2 more
Correct synchronisation in a distributed system is a difficult. One effective approach to the problem is to employ a logical clock on the high-level design, which ensures deterministic concurrency. However, most real-time network protocols only provide the means for physical time synchronisation. Therefore, in the end, the inherent logical clock has to be compiled away and mapped to physical time, losing many of its benefits. We propose a new middleware called softtide, which aims to facilitate the implementation and deployment of systems with an inherent logical clock. The idea is to provide a global logical clock through API, as the basis for scheduling task executions and message transmissions. At the same time, maintain a relatively stable relation between the logical clock and physical time, to limit the jitters between devices. The synchronisation mechanism is inspired by a recent protocol called bittide, which features a decentralised architecture. Softtide has the following mathematical properties: (1) Logical synchrony, where the transmission delays between devices are constant in logical time. (2) Its behaviour is deterministic even in the presence of network delays, differing clock frequencies, and faults. (3) Finally, softtide is decentralised in nature, where devices can dynamically join and leave. The synchronised logical clock provided by softtide simplifies the design, compilation, and validation of real-time distributed systems. Empirically we show the real-world performance of softtide to always produce deterministic results.
- Research Article
- 10.1145/3774886
- Nov 5, 2025
- ACM Transactions on Embedded Computing Systems
- Gareth Callanan + 1 more
Streaming applications are often described using dataflow actor models with a fixed network structure, allowing for static analysis and efficient hardware implementation. However, this fixed structure hinders scalability and design space exploration. This paper investigates a representative dataflow toolchain, the StreamBlock compiler for the CAL actor language, along with its Actor Machine (AM) Intermediate Representation (IR), identifying limitations in handling parametric application specifications. To address these limitations, we extend CAL to support parametric actor and network specifications allowing a single description to capture multiple problem sizes. We demonstrate these extensions with a parametric QR Decomposition application and benchmarks from the Savina Actor Benchmark Suite. When compiling actor specifications to software or hardware, the AM IR is used for optimisation purposes. The AM defines a controller specifying how actors should behave at runtime. We show that as the complexity of the actor increases, the AM model scales poorly, leading to compilation failing. In this work, we improve the AM model enabling the compilation of actors up to six times larger than previously possible. For specifications targeting FPGAs, we offer an alternative to the AM designed to take better advantage of available hardware parallelism. Our results show that this controller scales better with the size of the actor compared to the AM controller, reducing latency significantly for a slight increase in resources used. These contributions extend CAL’s applicability, making it easier to specify and scale a broader range of streaming applications.
- Research Article
- 10.1145/3772371
- Oct 21, 2025
- ACM Transactions on Embedded Computing Systems
- Lukas Liedtke + 3 more
The number of Internet of Things (IoT) devices is increasing exponentially, and it is environmentally and economically unsustainable to power all these devices with batteries. The key alternative is energy harvesting, but battery-less IoT systems require extensive evaluation to demonstrate that they are sufficiently performant across the full range of expected operating conditions. IoT developers thus need an evaluation platform that (i) ensures that each evaluated application and configuration is exposed to exactly the same energy environment and events, and (ii) provides a detailed account of what the application spends the harvested energy on. We therefore developed the EStacker evaluation platform which (i) enables fair and repeatable evaluation, and (ii) generates energy stacks. Energy stacks break down the total energy consumption of an application across hardware components and application activities, thereby explaining what the application specifically uses energy on. We augment EStacker with the ST-SP optimization which, in our experiments, reduces evaluation time by 6.3 × on average while retaining the temporal behavior of the battery-less IoT system (average throughput error of 7.7%) by proportionally scaling time and power. We demonstrate the utility of EStacker through two case studies. In the first case study, we use energy stack profiles to identify a performance problem that, once addressed, improves performance by 3.3 ×. The second case study focuses on ST-SP, and we use it to explore the design space required to dimension the harvester and energy storage sizes of a smart parking application in roughly one week (7.7 days). Without ST-SP, sweeping this design space would have taken well over one month (41.7 days).
- Research Article
- 10.1145/3772281
- Oct 18, 2025
- ACM Transactions on Embedded Computing Systems
- Xiaojie Zhang + 4 more
Recent research highlights a critical gap between the computing capabilities of modern IoT devices and the computational demands of Artificial Intelligent (AI) applications. The edge computing paradigm offers a promising solution by providing reliable and fast computing services close to the data source. Due to their inherent resource constraints, a single edge server often cannot handle the heavy computational load from nearby IoT devices, requiring multi-server collaboration. However, achieving efficient cooperation is challenging due to dynamic workload fluctuations and uneven data distribution. To address these issues, this paper presents a novel solution that involves optimizing task execution paths and resource management to enhance the performance of edge servers, particularly in scenarios with unbalanced data or uneven distribution of IoT devices. Our approach not only deploys multiple edge servers, but also focuses on the intelligent allocation and management of computing tasks. Specifically, we propose c2mec , a cooperative multi-split and multi-hop edge computing framework. The proposed c2mec framework uses problem decomposition to efficiently decouple the variables that need to be optimized. In addition, c2mec employs a multi-agent Deep Reinforcement Learning (DRL) based algorithm to mitigate the negative impact of decoupling and provides a flexible data splitting strategy. Finally, c2mec designs an energy-aware training method for IoT devices to reduce their long-term training cost. Through our comprehensive experimental results, we demonstrate how c2mec achieves notable improvement in energy saving across different scenarios compared to existing solutions, such as full and partial task offloading.