- New
- Research Article
- 10.1145/3812550
- Apr 25, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Swati Upadhyay + 1 more
Non-Volatile Memories (NVMs) provide new opportunities for scalable main memory due to their great scalability and other favorable attributes. However, considering their inherent write defects (long latency, high energy consumption), combined with their limited write endurance, their inclusion into the memory hierarchy requires careful handling. This not only shortens the lifespan of NVMs but also renders them as an expensive option. In this work, we propose a technique called EnVector that exploits the intrinsic uniformity in memory data to encode the cacheline data before the memory write takes place. By using well-designed fixed vectors, it minimizes the variation between the incoming and the stored data. This saves a lot of possible bit flips, reducing the likelihood of any memory failure from an excessive number of writes. Consequently, improving performance and lowering latency and energy requirements. We empirically demonstrated that EnVector upgrades the NVM lifetime by 37% and accelerates its performance by 15% over the baseline.
- New
- Research Article
- 10.1145/3812548
- Apr 24, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Jiawei Liu + 6 more
Understanding the functionality of Boolean networks is crucial for processes such as functional equivalence checking, logic synthesis, and malicious logic identification. With the proliferation of deep learning in electronic design automation (EDA), graph neural networks (GNNs) are widely used to embed and-inverter graphs (AIGs)—a standard form of Boolean networks—into vectorized representations. A key challenge in applying GNNs for Boolean representation is that although GNNs can effectively encapsulate the structural properties of AIGs, they struggle to efficiently capture Boolean logic functionality. In this work, we focus on breaking this bottleneck by enhancing the functional representation capability of GNNs, proposing PolarGate, an efficient solution that not only aligns message passing with AIG logical functionality but also effectively integrates global information. Leveraging the intrinsic ambipolar states (0 and 1) of AIG nodes, PolarGate maps gate behavior into an ambipolar state space, customizes differentiable logical operators, and designs a functionality-aware message passing strategy. To further capture global circuit information, PolarGate integrates a structure-aware preprocessing module and a global linear attention module, transcending the locality constraint of message passing. Experimental results on two functionality-related basic tasks (signal probability prediction and truth-table distance prediction) and a downstream task (logic equivalence prediction) show that PolarGate outperform state-of-the-art GNN-based methods.
- New
- Research Article
- 10.1145/3810248
- Apr 20, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Zhenlin Pei + 5 more
The emerging graphene interconnect technology is expected to be a promising alternative to traditional interconnect technologies due to its superior conductivity. Because of the dominant impact of global routing on the overall performance of a widely used application, the Field-Programmable Gate Array (FPGA), this work investigates the potential advantage of using graphene-based interconnects to replace conventional copper (Cu) for global routing. Furthermore, a scalability analysis is performed, and the effects of technology node scaling from 7 nm to 1.5 nm are evaluated using lateral gate-all-around field-effect transistors (LGAAFETs) within the proposed system-technology co-design (STCO) framework. Key material-level parameters, including the mean free path (MFP), contact resistance, and the number of graphene layers, are systematically analyzed. Benchmark simulations demonstrate that a 32% improvement in the energy-delay product (EDP) is achieved with graphene-based interconnects compared to Cu counterparts at the 7 nm technology node, and an additional 46% reduction is observed at the 1.5 nm technology node. It is important to note that this work is an exploratory STCO study incorporating cross-layer design considerations, with a focus on long-term trends rather than short-term manufacturability.
- New
- Research Article
- 10.1145/3798107
- Apr 20, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Hyuksoo Kim + 2 more
Recently, there has been a growing demand for real-time intelligent systems that can execute multiple deep neural network (DNN) models simultaneously for tasks such as object recognition, detection and tracking. However, running multiple DNNs simultaneously in resource-constrained embedded environments can lead to resource contention due to limited system resources. This can result in execution delays that cause critical issues in latency-sensitive processing. This paper proposes a dynamic scheduling technique that divides DNN models into functional units called blocks, which are then configured as execution units. Additionally, when running different models in parallel, it identifies blocks that actually increase execution time and controls them to run sequentially. Furthermore, to minimize execution delays while maintaining accuracy, we propose a dynamic lightweight replacement technique that replaces blocks with highly anticipated execution delays with lightweight blocks at runtime. This technique uses LAG , a metric which quantifies the degree of execution delay for each block, to dynamically adjust the balance between execution delays and accuracy. Experimental results show that when running multiple heterogeneous DNNs simultaneously on a commercial off-the-shelf board, the proposed technique improves latency by up to 29.3%, while maintaining 90% of baseline accuracy.
- Research Article
- 10.1145/3802924
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Jingyi Wang + 7 more
As the next generation of continuous-flow lab-on-a-chip platforms, Fully Programmable Valve Array (FPVA) biochips are revolutionizing traditional biochemical experiments with their remarkable flexibility and programmability. Due to the intricate interplay between chip architecture and bioassay protocols, architectural synthesis has emerged as a critical stage in such chips design, encompassing three core phases: high-level synthesis, component placement, and flow channel routing. Over the past decade, design automation for FPVA biochips has attracted significant interest and achieved notable progress. However, existing architectural synthesis algorithms for FPVA biochips typically treat these phases in isolation rather than as an integrated whole, leading to increased conflicts, resource redundancy, and potential design failures. To address these challenges, this paper proposes a high-quality and efficient one-pass architectural synthesis algorithm called OneSyn for FPVA biochips. By unifying all design phases into an “organic whole”, OneSyn eliminates gaps among these phases, yielding a more efficient and cost-effective biochip architecture. First, the one-pass synthesis for FPVA biochips is formulated as an Integer Linear Programming (ILP) model with a resource-aware objective and constraints on scheduling, placement, and routing, establishing a unified optimization framework that effectively prevents inter-phase conflicts. Second, various graph-based pruning strategies are proposed to eliminate redundant constraints in the ILP model based on component–reagent relationships in the sequencing graph. Consequently, the solution-space complexity is reduced, CPU time decreases, and overall efficiency improves. Experimental results demonstrate that, compared with related algorithms, OneSyn effectively optimizes both the total completion time of bioassays and the total path length of fluid transport.
- Research Article
- 10.1145/3806395
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Dewan Saiham + 2 more
Fully Homomorphic Encryption (FHE) enables secure computation directly over encrypted data, making it highly valuable for domains such as healthcare, finance, and cloud services. However, its deployment is still constrained by immense computational overheads and critical memory bandwidth limitations, particularly in CKKS bootstrapping and the Number Theoretic Transform (NTT). While recent hardware accelerators have improved arithmetic throughput, they remain constrained by inefficient memory transfers and off-chip communication overhead. To overcome these limitations, we introduce OptoLink , a photonic interconnect architecture at the chiplet scale. By exploiting Wavelength Division Multiplexing (WDM) and Space Division Multiplexing (SDM), OptoLink delivers ultra-high throughput and low-latency data transfer. Our design achieves up to 1.6 TB/s of bandwidth across 128 optical channels, representing a 300 × latency reduction compared with conventional electronic network. In addition, OptoLink provides efficient broadcast and multicast capabilities, significantly reducing redundant data movement. Using an extended FHE simulation framework, we show that OptoLink improves CKKS bootstrapping throughput by up to 11 × on HEAX and 1.6 × on F1 and ARK, while reducing memory transfer delays by orders of magnitude. Furthermore, encrypted machine learning workloads, including logistic regression training and ResNet-20 inference, benefit from higher throughput and alleviated bandwidth pressure, demonstrating the potential of OptoLink to enable practical large-scale FHE acceleration.
- Research Article
- 10.1145/3786351
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Shunyang Bi + 5 more
As the scale and complexity of designs increase, functional verification becomes a critical part of the very-large-scale integration (VLSI) design flow. However, existing processor-based emulation systems suffer from inefficiencies due to the misalignment objective between partitioning and scheduling, which are traditionally treated as separate and independent stages during compilation. To address this issue, we propose ParSCo , a partitioning and scheduling co-optimization framework that explicitly aligns the objectives of both stages by jointly considering cut minimization and topological order balancing (TOB) under multiple constraints. To integrate these objectives and constraints into our framework, we incorporate them into all partitioning and scheduling stages and further develop a set of novel techniques, including TOB-aware coarsening with multiple constraints , global growing initial partitioning with fixed nodes , TopoRefinement , and partitioning-aware scheduling , which collectively enhance the co-optimization process in emulation compilation. Furthermore, we establish theorems that reduce the time complexity of gain calculation and update to O (1), significantly improving the computational efficiency of the whole process. Furthermore, we evaluate the proposed method on the public and open-source chip design benchmarks, which have up to nearly 10 million cells. ParSCo significantly extends ideas and algorithms that first appeared in our previous work TopoOrderPart and achieves a 15% improvement. Extensive experimental results demonstrate the effectiveness of ParSCo , achieving an average improvement of 22.5% in time step reduction, 72% enhancement in TOB, and 55% acceleration in CPU time compared to the state-of-the-art (SOTA) two-stage partitioning and scheduling approach.
- Research Article
- 10.1145/3806057
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Yuxin Liu + 3 more
Optical and photonic computing systems offer a high-performance, energy-efficient paradigm for next-generation AI hardware, but their scaling hinges on cross-layer hardware-algorithm co-design and advanced automation tools, where CAD modeling is indispensable. Reconstructing parametric CAD models from generic 3D data, such as point clouds, is a problem of practical significance, as it allows non-editable geometry to be modified and reused. However, the current mainstream approach for CAD reconstruction faces two principal challenges: (1) inherent structural defects in sequential command representations, and (2) neglecting the complementary information across multiple modalities leads to incomplete feature extraction in deep learning. To overcome these limitations, we employ a multi-modal reconstruction network. It integrates the point cloud with its rendered multi-view images as the input information, with a lightweight similarity gating module dynamically fusing features of these two modalities. To address the structural defects, we propose a novel grouped entity structure, with a decoder which separately decodes extrusion entities and corresponding sketches in two stages. Experiments demonstrate that our method achieves a reconstruction Chamfer Distance of 0.002 and reduces the inefficiency rate to 5.49% on about 8,000 test samples of the standard dataset, which are 1/4 and 2/5 of those achieved by the baseline method, respectively. More importantly, we develop an end-to-end practical pipeline that automatically translates the network’s output into fully editable parametric models within industrial CAD software (CATIA V5). This bridge from deep learning to application demonstrates the strong practical value of our work. The code is available at https://gitlink.org.cn/fzhe/GEDNet.
- Research Article
- 10.1145/3803547
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Mohaddeseh Sharei + 3 more
This paper presents the GEMA+ sequence mapping technique, characterized by its minimal sensitivity to the length of input read sequences. Unlike previously proposed methods, including GEMA and other similar learned index mapping techniques, GEMA+ effectively addresses the challenge of non-uniformity in the ascending trend of the mapping speed with the increasing read length. This challenge is tackled by our innovative process that eliminates the need for padding, which has traditionally been a source of performance variability. GEMA+ retains the efficient data structures of the original GEMA, ensuring no additional memory overhead. Additionally, we propose modifications to the hardware architecture of GEMA, resulting in an efficient implementation of GEMA+ on FPGA devices. The performance of the proposed technique is evaluated through post place-and-route simulations, which reveal that GEMA+ achieves a consistent and uniform increase in mapping speed as read length increases. By reducing performance sensitivity to input sequence length, GEMA+ demonstrates an average performance improvement of 42.5% for short reads and 11.8% for long reads, compared to the original GEMA.
- Research Article
- 10.1145/3809134
- Apr 13, 2026
- ACM Transactions on Design Automation of Electronic Systems
- Xiaoyu Song + 4 more
The performance enhancements of coarse-grain reconfigurable cryptographic arrays (CGRCAs) through technology upgrades and increasing chip size are approaching limits. Enhancing parallel processing capabilities can relieve the computational burden on CGRCA in high-density computing scenarios. Specifically, a novel parallel processing approach is introduced in this paper, named virtual heterogeneous multi-core pipelining (VHMP). VHMP supports virtual heterogeneous multi-core and intra-core multitask pipelining. Informed by hardware virtualization principles and analysis of CGRCA architectures, VHMP constructs virtual computing cores (VCCs) on CGRCA using multi-launch pipelines and implements multitask interleave pipelines within each VCC, enabling parallel processing at pipeline, task, and instruction levels. Furthermore, a hierarchical control mechanism is embedded within VHMP, integrating virtual computing path level and task-level management. This control mechanism allows diverse cryptographic algorithms to run concurrently without being bound to specific types or modes. Finally, applying VHMP to a 32 × 4 CGRCA, 32 heterogeneous VCCs are instantiated, with each core handling up to 16 interleaved tasks. Compared to treating CGRCA as a homogeneous multi-core processor, the VHMP approach achieves an average acceleration of 4.33 × (up to 6.34 ×) with quadrupled instruction execution capability. Additionally, the control mechanism reduces context volume to 18.7% and boosts configuration speed by 3 ×. Compared with related architectures, VHMP improves CGRCA throughput by an average of 5.3 ×.