Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Ripple Carry Adder
  • Ripple Carry Adder
  • Carry Save Adder
  • Carry Save Adder

Articles published on Critical path delay

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
602 Search results
Sort by
Recency
  • Research Article
  • 10.1145/3805806
AI4DSE: Leveraging Dynamic Graph Neural Networks and Large Language Models for Optimizing High-Level Synthesis Design Space Exploration
  • Mar 28, 2026
  • ACM Transactions on Design Automation of Electronic Systems
  • Lei Xu + 3 more

High-Level Synthesis (HLS) Design Space Exploration (DSE) is essential for generating hardware designs that balance performance, power, and area (PPA). To optimize this process, existing works often employs message-passing neural networks (MPNNs) to predict quality of results (QoR). These predictors serve as evaluators in the DSE process, effectively bypassing the time-consuming estimations traditionally required by HLS tools. However, existing models based on MPNNs struggle with over-smoothing and limited expressiveness. Additionally, while meta-heuristic algorithms are widely used in DSE, they typically require extensive domain-specific knowledge to design operators and time-consuming tuning. To address these limitations, we propose ECoGNNs-LLMMHs, a framework that integrates graph neural networks with task-adaptive message passing and large language model-enhanced meta-heuristic algorithms. Compared with state-of-the-art works, ECoGNN exhibits lower prediction error in the post-HLS prediction task, with the error reduced by 57.27%. For post-implementation prediction tasks, ECoGNN demonstrates the lowest prediction errors, with average reductions of 17.6% for flip-flop (FF) usage, 33.7% for critical path (CP) delay, 26.3% for power consumption, 38.3% for digital signal processor (DSP) utilization, and 40.8% for BRAM usage. LLMMH variants can generate superior Pareto fronts compared to meta-heuristic algorithms in terms of average distance from the reference set (ADRS) with average improvements of 87.77%, respectively. Compared with the SOTA DSE approaches GNN-DSE and IRONMAN-PRO, LLMMH can reduce the ADRS by 68.17% and 61.53% respectively. Code and models are available at https://github.com/wslcccc/CoGNNs_LLMMH.

  • Research Article
  • 10.1038/s41598-026-40524-4
FPGA-based imprecise signed multiplier designs for high-performance image processing applications.
  • Feb 21, 2026
  • Scientific reports
  • Jaiza Hassan + 3 more

Multiplication is a fundamental mathematical operation that finds extensive applications across various disciplines, particularly in computation-intensive and error-resilient applications, such as image processing. As hardware circuits become more complex, there is a growing demand for approximation circuit methods. Implementation of approximate multipliers has the potential to yield substantial reductions in hardware costs while maintaining acceptable performance levels. Most current designs for approximate multipliers are optimized for ASIC-based circuits, which may not produce similar performance improvements when adapted for FPGA-based circuits. Additionally, many of these existing multiplier designs are limited to unsigned numbers. This paper proposes a novel approach for designing signed approximate multipliers tailored specifically for FPGAs. Two efficient architectures are introduced that efficiently utilize key FPGA components, such as LUTs and Carry4 primitives, by designing the optimal LUT-Carry4 netlists. A Pareto-based analysis is also performed to balance trade-offs and achieve a low mean error distance (MED). Simulation results confirm that the proposed architectures offer superior performance compared to existing signed approximate multipliers, delivering improved power efficiency, reduced resource usage, shorter critical path delay (CPD), and enhanced computational accuracy. The practical applicability of these approximate multipliers is further validated through their use in image processing applications.

  • Research Article
  • 10.1038/s41598-026-38147-w
Efficient computation and design of high speed double precision Vedic multiplier architecture.
  • Feb 5, 2026
  • Scientific reports
  • Aruru Sai Kumar + 5 more

Efficient multiplication and addition of floating-point numbers play a crucial role in digital signal processing applications. To achieve high computational performance with minimal resource utilization, an optimized multiplication approach is essential. Vedic mathematics encompasses the utilization of 16 sutras or algorithms. This paper presents a double-precision floating-point multiplier of 53-bit mantissa based on Vedic mathematics. The proposed architecture performs multiplication in three stages: sign generation, exponent generation, and mantissa multiplication. The Urdhva Tiryakbhyam sutra is employed for mantissa computation owing to its high efficiency and reduced hardware complexity compared to conventional techniques. The proposed multiplier design is implemented using Verilog HDL on Vivado 2022.2. Experimental results demonstrate a significant reduction in critical path delay and logic utilization compared to existing floating-point and Vedic-based multipliers, while maintaining a favorable power consumption trend. The CNN implementation employing the proposed Vedic double-precision floating-point multiplier achieves the lowest inference latency and power consumption while maintaining identical classification accuracy compared to conventional IEEE-754 and existing Vedic-based multiplier designs. Hardware realization on a Zynq FPGA device further confirms the superiority of the proposed architecture in terms of power, delay and on-board component utilization.

  • Research Article
  • 10.1088/2631-8695/ae3a3d
Implementation of a high-performance multiply-accumulate unit using DRPPE-enhanced vedic multiplier and flag-driven SPS adder
  • Feb 1, 2026
  • Engineering Research Express
  • Dyana Christilda V + 1 more

Abstract This paper proposes a high-performance, low-power Multiply-Accumulate (MAC) architecture leveraging a Dynamic Reconfiguration with Parallel Partial Evaluation (DRPPE) framework integrated with a 16×16 Vedic multiplier and a flag-driven Selective Partial Sum (SPS) adder. The design addresses the critical need for speed, area efficiency, and dynamic power reduction in digital signal processing (DSP), artificial intelligence (AI), and FPGA-based systems. By incorporating Urdhva-Tiryakbhyam Sutra-based Vedic multiplication, the architecture enables fast partial product generation and low-depth logic realization. The novel flag control mechanism intelligently suppresses redundant carry propagation during accumulation, thereby minimizing switching activity and improving energy efficiency. The proposed design was implemented in Verilog HDL and synthesized on a Virtex-5 FPGA using Xilinx ISE 14.7. Simulation results demonstrate a critical path delay of 2.05 ns, maximum operating frequency of 465.1 MHz, and total power consumption of 52.5 mW, significantly outperforming conventional MAC units in speed and resource utilization. Comparative analysis with recent literature confirms the design's effectiveness in delivering accuracy, configurability, and high-throughput performance suitable for next-generation embedded computing platforms.

  • Research Article
  • 10.71465/fair632
Graph Neural Networks for Timing Optimization in Advanced Node Placement
  • Jan 31, 2026
  • Frontiers in Artificial Intelligence Research
  • Jiahao Liu + 2 more

The escalating complexity of modern integrated circuit design demands innovative approaches to address timing optimization challenges in advanced technology nodes. Graph Neural Networks (GNNs) have emerged as a transformative paradigm for modeling circuit representations and optimizing placement decisions. This paper presents a comprehensive investigation of GNN applications in timing-driven placement optimization for sub-10nm process technologies. We propose a novel framework that leverages GNN architectures to encode circuit connectivity patterns, predict timing metrics, and guide placement algorithms toward solutions that minimize critical path delays while maintaining acceptable wirelength overhead. Our methodology employs a two-stage GNN model integrating global placement refinement with local timing optimization subroutines. The framework captures spatial dependencies between circuit components through message passing mechanisms while incorporating timing constraints directly into the optimization objective. Experimental evaluations on industry benchmark circuits demonstrate that GNN-based timing optimization achieves 18-24% reduction in worst negative slack compared to conventional analytical placement methods, with runtime improvements of 3-5x over traditional static timing analysis iterations. The proposed approach maintains placement quality metrics including wirelength increase below 7% and demonstrates robust convergence across diverse circuit topologies ranging from processor cores to memory controllers. This research establishes GNNs as viable alternatives to conventional timing-driven placement algorithms and opens new directions for machine learning integration in electronic design automation workflows.

  • Research Article
  • 10.1080/00207217.2026.2616830
Design of an adaptive finite impulse response filter using low error efficient approximate adder and two-stage operand trimming approximate logarithmic multiplier for WSN in IoT
  • Jan 31, 2026
  • International Journal of Electronics
  • Arun Antony V + 1 more

ABSTRACT In digital signal processing, specifically focusing on Finite Impulse Response (FIR) filters and their application in Wireless Sensor Networks (WSN) for IoT. FIR filters are essential components in signal processing systems, used for tasks such as noise reduction, signal enhancement and data analysis. Optimising FIR filter designs becomes essential in WSNs, where energy efficiency, precision and real-time processing are important. In this paper, design of an adaptive finite impulse response filter using low error efficient approximate adder and two-stage operand trimming approximate logarithmic multiplier for WSN in IoT environment (FIR-LEAA-TOTAM-WSN) is proposed. The FIR filter leverages Low Error Efficient Approximate Adder (LEAA) to significantly reduce critical path delay compared to traditional 1-bit full adders. This design effectively reduces noise and interference affecting WSNs, thereby reducing energy consumption, increasing the operational lifespan of WSNs used in IoT applications. The implementation of this design is carried out using Verilog, followed by synthesis through the Xilinx ISE suite with FPGA synthesis performed by Xilinx tools. The experimental outcomes show that the performance of HDP-FIR-LEAA-TOTAM-WSN approach attains low delay, higher dynamic power and low energy consumption when compared with existing methods, respectively.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/electronics15010213
Design and Analysis of a Configurable Dual-Path Huffman-Arithmetic Encoder with Frequency-Based Sorting
  • Jan 2, 2026
  • Electronics
  • Hemanth Chowdary Penumarthi + 2 more

The designs of lossless data compression architectures create a natural trade-off between throughput, power consumption, and compression efficiency, making it difficult for designers to identify an optimal configuration that satisfies all three criteria. This paper proposes a Configurable Dual-Path Huffman/Arithmetic Encoder (CDP-HAE), which offers an architecture that supports the use of shared preprocessing, parallel path encoding using Huffman and Arithmetic, as well as selectable output. The CDP-HAE’s design prevents the waste of excess bandwidth by sending only one selected bit stream at a time. This also enables adaptation to the dynamically changing statistical characteristics of the input data. CDP-HAE’s architecture underwent ASIC synthesis in 90 nm CMOS technology and is implemented on an Artix-7 (A7-100T) using the Vivado EDA tool, confirming the scalability of the architecture to both devices. Synthesis results show that CDP-HAE improves operating frequency by 28.6% and reduces critical path delay by 27.2% compared to reference designs. Additionally, the dual-path design has a slight increase in area; the area utilization remains within reasonable limits. Power analysis indicates that optimizing logic sharing and minimizing switching activity reduces total power consumption by 34.4%. Compression tests show that the CDP-HAE delivers performance comparable to that of a conventional Huffman Encoder using application-specific datasets. Furthermore, the proposed CDP-HAE achieves performance comparable to conventional Huffman encoders on application-specific datasets, while providing up to 10% improvement in compression ratio over Huffman-only encoding.

  • Research Article
  • 10.1109/tcad.2026.3678497
Critical Path Aware Timing-Driven Global Placement for Large-Scale Heterogeneous FPGAs
  • Jan 1, 2026
  • IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • He Jiang + 6 more

Timing optimization during global placement is critical for achieving optimal circuit performance and remains a key challenge in modern Field Programmable Gate Array (FPGA) design. As FPGA designs scale and heterogeneous resources increase, dense interconnects introduce significant resistive and capacitive effects, making timing closure increasingly difficult. Existing methods face challenges in constructing accurate timing models due to multi-factor nonlinear constraints as well as load and crosstalk coupling effects arising in multi-pin driving scenarios. To address these challenges, we propose TD-Placer, a critical path aware, timing-driven global placement framework. It leverages graph-based representations to capture global net interactions and employs a nonlinear model to integrate diverse timing-related features for precise delay prediction, thereby improving the overall placement quality for FPGAs. TD-Placer adopts a quadratic placement objective that minimizes wirelength while incorporating a timing term constructed by a lightweight algorithm, enabling efficient and high-quality timing optimization. Regarding net-level timing contention, it also employs a finer-grained weighting scheme to facilitate smooth reduction of the Critical Path Delay (CPD). Extensive experiments were carried out on seven real-world open-source FPGA projects with LUT counts ranging from 60K to 400K. The results demonstrate that TD-Placer achieves an average ∼10% improvement in Worst Negative Slack (WNS) and a ∼5% reduction in CPD compared to the state-of-the-art method, with an average CPD comparable (×1.01) to the commercial AMD Vivado across five versions (2020.2–2024.2). Its code and dataset are publicly available<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>.

  • Research Article
  • 10.3103/s1060992x25601691
Leveraging Graph Representations to Enhance Critical Path Delay Prediction in Digital Complex Functional Blocks Using Neural Networks
  • Dec 1, 2025
  • Optical Memory and Neural Networks
  • M Dashiev + 2 more

Accurate critical path delay estimation plays a vital role in reducing unnecessary routing iterations and identifying potentially unsuccessful design runs early in the flow. This study proposes an architecture that integrates graph representations derived from digital complex functional blocks netlist and design constraints, leveraging a Multi-head cross-attention mechanism. This architecture significantly improves the accuracy of critical path delay estimation compared to standard tools provided by the OpenROAD EDA. The mean absolute percentage error (MAPE) of the OpenRoad standard tool—openSTA is 12.60%, whereas our algorithm achieves a substantially lower error of 7.57%. A comparison of various architectures was conducted, along with an investigation into the impact of incorporating netlist-derived information.

  • Research Article
  • 10.13052/jicts2245-800x.1322
Design of a Low-latency Multi-source Data Scheduling Algorithm for a 5G Environment
  • Nov 25, 2025
  • Journal of ICT Standardization
  • Jiali Zhou + 1 more

Aimed at the problems of high delay and low resource allocation efficiency of multi-source heterogeneous data task scheduling in 5G edge computing environment, this paper designs a multi-source data scheduling algorithm framework for low-latency optimization. An end-edge-cloud cooperative system model is constructed, and a set of dynamic priority scheduling strategies is proposed based on the task’s directed acyclic graph (DAG) graph to express the inter-task data dependency relationships, and the task scheduling order is adjusted in real time by fusing the task tightness urgency, the resource pressure and the network state changes. In order to improve the stability of the system under high load, a multi-dimensional load evaluation mechanism and a granularity-adaptive task partitioning and merging method are introduced, and a cache hit-aware resource allocation function and an edge node cache replacement strategy are designed. In addition, a QoS guarantee mechanism and a network state-aware feedback module are constructed to realize dynamic correction of task scheduling accuracy under delay constraints. Multiple rounds of comparison experiments are carried out in the simulation platform, and the results show that this paper’s algorithm can control the average task completion delay within 45 ms under medium-high load conditions, significantly reducing the critical path delay, stabilizing the QoS compliance rate to more than 94%, increasing the resource utilization rate to 87.5%, and achieving a scheduling hit rate of 92.4%. The above results verify the algorithm’s low latency control capability and system resource synergy in dynamic task environments, with good engineering adaptability, suitable for edge intelligent application deployment with high real-time requirements in 5G scenarios.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-25239-2
Design of high-performance, accurate, and approximate Dadda-tree multipliers for image processing applications
  • Nov 21, 2025
  • Scientific Reports
  • Aqib Amin Rather + 3 more

Approximate computing comes to the fore as an alternative paradigm to enhance efficiency in computing systems by trading off the system’s accuracy for better performance. This paper seeks to leverage the principles of approximate computing to design efficient multiplier architectures for FPGA platforms. Specifically, this work presents FPGA implementations of one accurate and two approximate multiplier units based on the Dadda algorithm. The multipliers employ a novel partial product reduction technique that minimizes the utilized resources and the critical path delay, offering a more resource-efficient alternative than traditional multipliers. Our accurate and best-performing approximate 8 × 8 multiplier shows an improvement of 28% and 37% in PDAP over the Xilinx exact multiplier and the most performance-efficient existing approximate multiplier, respectively. Further evaluation based on the processing of images with different modalities shows a substantial improvement in PSNR over the existing approximate multipliers, especially in the healthcare domain, thereby highlighting the possible application of the proposed multipliers in error-resilient medical imaging tasks.

  • Research Article
  • 10.1145/3769303
DPTM: An Adaptive Scheduler Design Utilizing Timeslot Matching and Release Methods for Concurrent and Multi-task Interleaved Pipelining-oriented CGRA
  • Nov 11, 2025
  • ACM Transactions on Design Automation of Electronic Systems
  • Danping Jiang + 4 more

Coarse-grained reconfigurable architectures (CGRAs) are increasingly employed as domain-specific accelerators due to their efficiency and flexibility. However, the existing CGRA architectures suffer from low hardware resource utilization and performance due to the limitations of the scheduling scheme. In this article, an adaptive scheduler (denoted as DPTM) for concurrent and multi-task interleaved pipelining-oriented CGRA is introduced, which exploits timeslot matching and release methods to avoid the pipeline conflicts and improve the scheduling performance. The characteristics of task scheduling based on directed acyclic graph (DAG) are analyzed, and several performance-influencing factors are extracted to build a scheduling performance model for reducing the time cost of scheduling and guiding the design of scheduling schemes. Moreover, the scoreboard method of dynamic instruction schedulers is optimized to control the entry time of multiple tasks into the pipeline, and then a timeslot matching method is proposed to provide non-conflict pipelining for the multiple tasks. Further, a timeslot release method is presented to release the timeslots for unscheduled sub-tasks dynamically, which can adapt the parallel processing of multiple tasks and decrease the scheduling time. Then, an adaptive scheduling scheme combines the dynamic priority-based task assignment method, timeslot matching method, and timeslot release method to schedule massive tasks for CGRA. Finally, the overall architecture of DPTM is introduced and designed to validate the efficacy of the proposed scheduling scheme. Experimental results show that the proposed timeslot matching/release approach reduces 84% total scheduling time and decreases 40% average scheduling time at most compared to the non-timeslot-matching scheduling schemes, the proposed task assignment approach decreases 8% total scheduling time and lowers 3% average scheduling time compared to the existing approaches, and the proposed scheduler decreases 51% critical path delay, lowers 35% area overhead, and reduces 12% power consumption at most compared with the existing schedulers.

  • Research Article
  • 10.48084/etasr.12069
Pipelined Diagonal Matrix Codes for Error Correction in Embedded Memories
  • Oct 6, 2025
  • Engineering, Technology &amp; Applied Science Research
  • C H Kavya + 1 more

Semiconductor memories are the basic storage elements for advanced FPGAs. However, with technology scaling due to high packing densities, the temperature of the device rises drastically, creating temporary or permanent faults that manifest as errors in the stored data. Permanent errors cannot be corrected, but temporary errors can. If the data in the memory is critical, such as data used during satellite or missile launch, patient data, etc., there is a need for Error-Detecting and Correcting (EDAC) code. Memories are represented as a matrix that stores data in rows. EDAC codes correct random (errors at various distributed locations) and burst errors (a sequence of erroneous bits within a row). The Hamming code represents the basis for any EDAC code. This work focuses on a single code used to identify 8-bit erroneous data and correct for 6 and 7 random bit errors and 8 burst errors. The matrix code utilizes a memory representation and Hamming code to detect and correct errors, taking care to increase the code rate with less area and delay. In addition, a pipelining technique is used to reduce power dissipation, which also helps to increase the speed of the design. The codes were modeled in Verilog HDL and verified for the Zynq 7000 series FPGA using Xilinx Vivado 2023.2. The results were verified for technological parameters, such as area in terms of LUTs, critical path delay, power dissipation, etc., and for non-technological parameters such as code rate, bit overhead, detection capability, correction capability, etc. The proposed pipelined matrix code was better in most aspects compared to existing designs.

  • Research Article
  • Cite Count Icon 1
  • 10.1088/2631-8695/ade6c9
High-performance approximate multiplier design for FPGA platforms
  • Jul 1, 2025
  • Engineering Research Express
  • Mohsin Shah + 2 more

Abstract Approximate computing represents a computational paradigm that trades off a slight reduction in accuracy for significant performance improvements. One of the fundamental operations that can leverage approximate techniques is multiplication, which is used substantially in applications like image/video processing and machine learning. This work proposes an approximate 8-bit multiplier design for FPGA-based circuits. This multiplier, by exploiting the FPGA primitives, demonstrates excellent performance regarding error metrics, critical path delay, and power dissipation with minimal LUT utilization. More precisely, the proposed design reduces LUT usage by 43% and PDP by 59% compared to the exact multiplier while incurring a mean error distance of only 102.57. The proposed approximate multiplier is used in two image processing applications to assess the actual advantages in real-world applications. The proposed design achieves a reasonable PSNR in the image processing flow, demonstrating high-quality results with a low error rate.

  • Research Article
  • 10.11591/ijra.v14i2.pp204-213
Designing high power efficient finite impulse response filters with three-four inexact adder-integrated Booth multiplier
  • Jun 1, 2025
  • IAES International Journal of Robotics and Automation (IJRA)
  • Manju Inasu Kollannur + 1 more

Finite impulse response (FIR) filters are widely utilized in several applications in digital signal processing, including data transmission, photography, digital audio, and biomedicine. It is necessary to use high sample rates for FIR filters, while moderate sample rates are needed for low-power circuits. To solve these problems, a Booth multiplier based on three-four inexact adder-based multiplication (TFIE-BM) was proposed. The goal of the proposed TFIE-based FIR Booth multiplier is to lower area usage, latency, and power consumption. The proposed method utilizes the spotted hyena optimizer (SHO) to find the optimal filter coefficient (FC) by minimizing the pass power consumption and Transition bandwidth. Moreover, a high-performance three-four inexact adder (TIFE adder) has been introduced, which uses fewer XOR gates for sum and carry generation, indicating that the logic has been simplified to reduce hardware complexity. By increasing speed and decreasing the FIR filter's critical path delay, a modified Booth multiplier that uses a 5:2 compressor is introduced. The overall delay of the proposed approach is 23.4%, 18.7%, 12.3%, and 5.7% lower than that of the Radix-4 Booth multiplier, CSA Booth multiplier, hybrid multiplier, and traditional Booth multiplier, respectively.

  • Research Article
  • 10.1145/3737291
R-LUT: A Reduced LUT Architecture with Fine-Grained Scalability and its Automatic Design Flow for Large Frequent Functions
  • May 23, 2025
  • ACM Transactions on Reconfigurable Technology and Systems
  • Moucheng Yang + 3 more

As technology scaling exacerbates interconnect resistance in advanced nodes, FPGA architectures demand enhanced programmable logic blocks (PLBs) to minimize global metal routing. However, it is expensive to raise the functionality of LUTs due to exponential area growth with the number of inputs, resulting in poor scalability. Moreover, LUTs are redundant since practical functions in real-world benchmarks only account for an extremely small proportion of all the functions. For example, only 16424 out of more than 100 trillion NPN classes of 6-input functions are used in the mapped netlists of the VTR8 and KOIOS benchmarks. Therefore, we propose a reduced LUT architecture, named RLUT, to efficiently implement most of the frequent functions. The compact structure of the MUX tree in LUTs is preserved and reduced, while the reduced programmable bits are connected to the MUX tree according to the bit assignment generated automatically by the proposed algorithms. Results of evaluations by a full EDA flow show that, compared with the modified Stratix10 baseline, the proposed 8-input PLB with 75 SRAM bits, named Dual-RLUT6, reduces the maximum logic levels significantly by 20.85%, while the critical path delay is improved by 10.11% at the cost of 4.65% area overhead.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/electronics14112122
Accelerating CRYSTALS-Kyber: High-Speed NTT Design with Optimized Pipelining and Modular Reduction
  • May 23, 2025
  • Electronics
  • Omar S Sonbul + 2 more

The Number Theoretic Transform (NTT) is a cornerstone for efficient polynomial multiplication, which is fundamental to lattice-based cryptographic algorithms such as CRYSTALS-Kyber—a leading candidate in post-quantum cryptography (PQC). However, existing NTT accelerators often rely on integer multiplier-based modular reduction techniques, such as Barrett or Montgomery reduction, which introduce significant computational overhead and hardware resource consumption. These accelerators also lack optimization in unified architectures for forward (FNTT) and inverse (INTT) transformations. Addressing these research gaps, this paper introduces a novel, high-speed NTT accelerator tailored specifically for CRYSTALS-Kyber. The proposed design employs an innovative shift-add modular reduction mechanism, eliminating the need for integer multipliers, thereby reducing critical path delay and enhancing circuit frequency. A unified pipelined butterfly unit, capable of performing FNTT and INTT operations through Cooley–Tukey and Gentleman–Sande configurations, is integrated into the architecture. Additionally, a highly efficient data handling mechanism based on Register banks supports seamless memory access, ensuring continuous and parallel processing. The complete architecture, implemented in Verilog HDL, has been evaluated on FPGA platforms (Virtex-5, Virtex-6, and Virtex-7). Post place-and-route results demonstrate a maximum operating frequency of 261 MHz on Virtex-7, achieving a throughput of 290.69 Kbps—1.45× and 1.24× higher than its performance on Virtex-5 and Virtex-6, respectively. Furthermore, the design boasts an impressive throughput-per-slice metric of 111.63, underscoring its resource efficiency. With a 1.27× reduction in computation time compared to state-of-the-art single butterfly unit-based NTT accelerators, this work establishes a new benchmark in advancing secure and scalable cryptographic hardware solutions.

  • Research Article
  • Cite Count Icon 13
  • 10.1109/tnnls.2024.3425569
Digit-Serial DA-Based Fixed-Point RNNs: A Unified Approach for Enhancing Architectural Efficiency.
  • May 1, 2025
  • IEEE transactions on neural networks and learning systems
  • Mohd Tasleem Khan + 1 more

The next crucial step in artificial intelligence involves integrating neural network models into embedded and mobile systems. This requires designing compact and energy-efficient neural network models in silicon for optimized performance. This article introduces a unified approach for enhancing the architectural efficiency of long short-term memory (LSTM) recurrent neural networks (RNNs). Precisely, two new structures (I and II) based on the two's complement (TC) digit-serial distributed arithmetic (DSDA) technique are presented. The block-circulant matrix-vector multiplications (MVMs) and element-wise multiplications (EWMs) are formulated using TC DSDA. In addition, a fixed-point (FxP) training procedure for quantized LSTM RNNs is considered and validated for speech recognition tasks. Both structures leverage the circular rotation of weights and generate partial products with input digit slices. A new partial-product generator (PPG) and partial-product selector (PPS) designed to work with both unsigned and signed digits is introduced. In Structure I, a nonpipelined MVM is realized with a few PPGs and PPSs, followed by a shift-accumulate unit (SAU). Conversely, in Structure II, a suitably chosen depth-pipelined MVM is achieved with multiple PPGs and PPSs, followed by a shift-to-add tree (SAT). A critical path delay (CPD) analysis for both the proposed structures is also presented. Compared with previous works, post-synthesis results on 28-nm fully depleted silicon-on-insulator (FDSOI) technology reveal that for a model size of $128 \times 128$ , Structures I and II provide 39.87%, 95.63%, and 30.95%, 91.18% more area and energy efficiencies, respectively.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/tvlsi.2025.3529690
An Area-Efficient VLSI Architecture for High-Throughput Computation of the 2-D DWT
  • May 1, 2025
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Yuzhou Dai + 5 more

In this article, an area-efficient VLSI architecture scheme for high-throughput computation of the 2-D discrete wavelet transform (DWT) is proposed, effectively applied in the context of aircraft cargo hold scenes. The proposed architecture aims to reduce computation and storage resources while maintaining the DWT-IDWT reconstructed image quality for the 9/7 discrete wavelet. The hardware implementation formulae based on the flipping architecture have been modified to reduce RAM storage bit width. By transforming the coefficients of the formula into hardware-friendly values, the required multiplication operations are split into two stages of addition. On this basis, a pipelined architecture is constructed to set the critical path delay (CPD) of the architecture to be close to the delay of a single adder, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$T_{a}$ </tex-math></inline-formula>, thereby achieving a high throughput. Compared to existing architectures in the research field, the proposed single-level 2-D DWT architecture achieves resource savings on the field-programmable gate array (FPGA) platform while ensuring good image reconstruction quality. The advantages of the multilevel 2-D DWT are even more pronounced. In the simulation results on the application-specific integrated circuit (ASIC) platform, the proposed architecture reduces computation time by at least 35.54% while achieving a higher level of decomposition, decreases the area-delay product (ADP) by at least 25.41%, and saves a significant amount of energy per image (EPI). Furthermore, the proposed folded architecture achieves close to 100% hardware utilization efficiency (HUE) in multilevel 2-D DWT computations.

  • Research Article
  • Cite Count Icon 1
  • 10.1142/s0218126625503232
Design, Verification and Hardware Implementation of 4kB Synchronous FIFO for Automatic Vehicle Communication Applications
  • Apr 30, 2025
  • Journal of Circuits, Systems and Computers
  • Ribu Mathew + 2 more

Sustainability in in-vehicle communication comprises the integration of wireless technology for faster and more reliable connectivity. Driverless vehicles need to schedule various tasks on a priority basis. The first-in-first-out (FIFO) data structure is critical for handling sensor data and communication information, and a memory queue is important for controlling data flow through organized read and write processes. This study focuses on the design, verification and synthesis of a FIFO with a size of 4k bytes. By operating on a single clock signal, synchronous FIFO enables smooth data transmission between a source and destination within the same clock domain. The designed FIFO resulted in an area of 41.401[Formula: see text]mm2, a power dissipation of 0.506[Formula: see text]mW and a critical path delay of 0.03[Formula: see text]ps when a 15[Formula: see text]nm open-cell library from Nangate was used. The design is also implemented using an Arria II field programmable gate array (FPGA), which results in a maximum clock frequency of 33.33[Formula: see text]GHz.

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers