Discovery Logo
Sign In
Search
Paper
Search Paper
R Discovery for Libraries Pricing Sign In
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
Discovery Logo menuClose menu
  • Home iconHome
  • My Feed iconMy Feed
  • Search Papers iconSearch Papers
  • Library iconLibrary
  • Explore iconExplore
  • Ask R Discovery iconAsk R Discovery Star Left icon
  • Literature Review iconLiterature Review NEW
  • Chat PDF iconChat PDF Star Left icon
  • Citation Generator iconCitation Generator
  • Chrome Extension iconChrome Extension
    External link
  • Use on ChatGPT iconUse on ChatGPT
    External link
  • iOS App iconiOS App
    External link
  • Android App iconAndroid App
    External link
  • Contact Us iconContact Us
    External link
  • Paperpal iconPaperpal
    External link
  • Mind the Graph iconMind the Graph
    External link
  • Journal Finder iconJournal Finder
    External link
features
  • Audio Papers iconAudio Papers
  • Paper Translation iconPaper Translation
  • Chrome Extension iconChrome Extension
Content Type
  • Journal Articles iconJournal Articles
  • Conference Papers iconConference Papers
  • Preprints iconPreprints
  • Seminars by Cassyni iconSeminars by Cassyni
More
  • R Discovery for Libraries iconR Discovery for Libraries
  • Research Areas iconResearch Areas
  • Topics iconTopics
  • Resources iconResources

Related Topics

  • Loop Tiling
  • Loop Tiling
  • Loop Transformations
  • Loop Transformations
  • Compiler Optimizations
  • Compiler Optimizations
  • Loop Nests
  • Loop Nests

Articles published on Loop Unrolling

Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
216 Search results
Sort by
Recency
  • Research Article
  • 10.47760/ijcsmc.2026.v15i03.022
Optimization Techniques for Matrix Multiplication Kernels in Linear Algebra Libraries: A CPU-Focused Approach
  • Mar 30, 2026
  • International Journal of Computer Science and Mobile Computing
  • Rajalakshmi Srinivasaraghavan

Matrix multiplication is a fundamental operation in linear algebra libraries, serving as the computational backbone for scientific computing, machine learning, and data analytics applications. This paper presents a comprehensive analysis of optimization techniques for General Matrix Multiply (GEMM) operations on modern CPU architectures. We examine six core optimization strategies: loop unrolling, vectorization using SIMD instructions, cache blocking, matrix packing, hardware-specific acceleration (including Intel AMX and IBM Power10 MMA), and prefix caching. Additionally, we discuss specialized optimizations for matrix-vector multiplication (GEMV) when matrix dimensions reduce to vectors. Our experimental results demonstrate performance improvements ranging from 2x to 24x over naive implementations, achieving 73-78% of theoretical peak performance across IBM Power architectures. The paper provides detailed implementation examples, particularly highlighting IBM Power10 Matrix-Multiply Assist (MMA) instructions based on OpenBLAS implementations.

  • Research Article
  • 10.15276/imms.v16.no1.43
Problems of automatic code optimization by the compiler
  • Dec 31, 2025
  • Informatics and mathematical methods in simulation
  • I Zhulkovska + 4 more

Rational use of modern compiler capabilities, in particular automatic SIMD vectorization, enables significant improvements in computational performance for tasks involving dataarray processing and computer modeling of complex processes and systems.The growing demand for software performance in scientific computing, big-data analysis, artificial intelligence, and machine learning emphasizes the importance of exploiting hardware-level data parallelism.This study investigates the efficiency of automatic SIMD vectorization provided by the Microsoft Visual C++ compiler in comparison with manual optimization implemented through AVX2 instructions.To evaluate performance, three implementations were developed: a scalar baseline version, a compiler-optimized automatic SIMD code, and a manually vectorized SIMD version using intrinsic functions.Computational experiments were conducted using the SAXPY operation for arrays sized from 10 5 to 10 9 .The results demonstrated that automatic SIMD vectorization provides up to a 7.5x speedup with an efficiency of 0.94 for small-and medium-scale problems, effectively utilizing processor resources through aggressive optimizations such as loop unrolling and efficient use of FMA pipelines Manual SIMD optimization showed stable acceleration of up to 3. for large arrays but with lower efficiency (0.28-0.49due to memory-bandwidth limitations and less aggressive compiler-level transformations.The comparison revealed that automatic methods are more convenient for developers, significantly reducing the effort required for writing SIMD code, while manual optimizations remain relevant when scaling to large data volumes.The findings indicate that the optimal strategy is a combined use of automatic and manual SIMD transformations, allowing a balance between performance, accuracy, and development effort, thus ensuring both efficiency and scalability of software solutions in high-performance computing and computer modeling.Future research will focus on expanding the experimental base across various processor architectures, analyzing the interaction of SIMD vectorization with other compiler transformations, and applying MLbased methods for adaptive optimization-strategy selection.

  • Research Article
  • 10.3390/computers15010003
Design of an Energy-Efficient SHA-3 Accelerator on Artix-7 FPGA for Secure Network Applications
  • Dec 21, 2025
  • Computers
  • Abdulmunem A Abdulsamad + 1 more

As the demand for secure communication and data integrity in embedded and networked systems continues to grow, there is an increasing need for cryptographic solutions that provide robust security while efficiently using energy and hardware resources. Although software-based implementations of SHA-3 provide design flexibility, they often struggle to meet the performance and power limitations of constrained environments. This study introduces a hardware-accelerated SHA-3 solution tailored for the Xilinx Artix-7 FPGA. The architecture includes a fully pipelined Keccak-f [1600] core and incorporates design strategies such as selective loop unrolling, clock gating, and pipeline balancing to enhance overall efficiency. Developed in VHDL and synthesised using Vivado 2024.2.2, the design achieves a throughput of 1.35 Gbps at 210 MHz, with a power consumption of 0.94 W—yielding an energy efficiency of 1.44 Gbps/W. Validation using NIST SHA-3 vectors confirms its reliable performance, making it a promising candidate for secure embedded systems, including IoT platforms, edge devices, and real-time authentication applications.

  • Research Article
  • 10.48084/etasr.11931
High Clarity, Low Power: Achieving a 43× Speed Boost in Image Defogging Using FPGA Acceleration
  • Oct 6, 2025
  • Engineering, Technology & Applied Science Research
  • Van Khoa Pham + 1 more

This study presents a Field-Programmable Gate Array (FPGA) accelerator designed for real-time image defogging at the edge, achieving high throughput and low power consumption. The design adapts the Dark Channel Prior (DCP) algorithm for hardware implementation using High-Level Synthesis (HLS) and incorporates advanced optimizations such as pipelining, loop unrolling, and dataflow control to enhance processing on resource-constrained devices. Implemented on a PYNQ-Z2 board running at just 100 MHz, the system achieves a remarkable 12.06 Frames per Second (FPS), surpassing a 2 GHz ARM processor by over 43× in speed. Power measurements show a low power consumption of 0.79 W, translating to a 153× improvement in energy efficiency (FPS/W) compared to an ARM-based software implementation. The proposed accelerator introduces a 4.7% pixel-level error, primarily affecting brightness consistency; nonetheless, it significantly outperforms processor-based approaches in both latency and power consumption. By demonstrating how FPGAs can sustain high-clarity image enhancement at the network edge, this work lays the groundwork for deployment in autonomous vehicles, remote surveillance systems, and environmental monitoring platforms where robust, low-power vision processing is critical.

  • Research Article
  • 10.30837/2522-9818.2025.3.189
Optimization of software code for high-level synthesis during hardware implementation of the computationally-loaded algorithms
  • Sep 25, 2025
  • INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES
  • Oleksandr Shkil + 4 more

The subject matter of the work is the impact of code optimization methods of highly intensive algorithms, used in digital signal processing, on hardware costs and performance when implemented on different platforms. The goal of the work is to conduct a comparative analysis of the impact of the effects of three C-code optimization approaches: loop unrolling, switching to fixed-point arithmetic, and their combinations, on performance and hardware costs when implementing matrix multiplication, fast Fourier transform, and wavelet transform algorithms using high-level synthesis (HLS) tools on system-on-chip (SoC) platforms, personal computers (PCs), and single-board computers. The following tasks were solved in the article: implementation of highly intensive algorithms based on selected hardware platforms and using HLS; comparison of execution time of algorithms with and without different optimization methods; comparison of hardware costs for algorithms’ implementations with and without different variants of optimization; formulate conclusions about the impact of different C-code optimization methods on performance and hardware costs on different target platforms. The following methods were used: C/C++ code optimization methods, diagnostic experiments using high-level synthesis tools to implement digital signal processing algorithms on the selected hardware platform, and statistical data collection using Python. The following results were obtained: for algorithms based on arithmetic operations, code optimization provided up to 30% reduction in execution time on ARM platforms. For algorithms based on the Fourier transform, complex optimization reduced execution time by up to 90% on processor devices. For programmable logic (FPGA), none of the optimization methods provided a significant execution acceleration. However, the transition to fixed arithmetic reduced hardware costs by 40–80% regardless of the algorithm type. Conclusions. The choice of a C code optimization strategy significantly impacts the efficiency of algorithm implementation on processor architectures. In contrast, optimizing the data types used plays a key role for FPGAs. In contrast, for FPGAs, optimizing the data types used plays a key role.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/electronics14173457
Design of Real-Time Gesture Recognition with Convolutional Neural Networks on a Low-End FPGA
  • Aug 29, 2025
  • Electronics
  • Rui Policarpo Duarte + 4 more

Hand gesture recognition is used in human–computer interaction, with multiple applications in assistive technologies, virtual reality, and smart systems. While vision-based methods are commonly employed, they are often computationally intensive, sensitive to environmental conditions, and raise privacy concerns. This work proposes a hardware/software co-optimized system for real-time hand gesture recognition using accelerometer data, designed for a portable, low-cost platform. A Convolutional Neural Network from TinyML is implemented on a Xilinx Zynq-7000 SoC-FPGA, utilizing fixed-point arithmetic to minimize computational complexity while maintaining classification accuracy. Additionally, combined architectural optimizations, including pipelining and loop unrolling, are applied to enhance processing efficiency. The final system achieves a 62× speedup over an unoptimized floating-point implementation while reducing power consumption, making it suitable for embedded and battery-powered applications.

  • Research Article
  • Cite Count Icon 4
  • 10.7717/peerj-cs.3077
Hardware implementation of FPGA-based spiking attention neural network accelerator
  • Aug 5, 2025
  • PeerJ Computer Science
  • Shiyong Geng + 5 more

Spiking neural networks (SNNs) are recognized as third-generation neural networks and have garnered significant attention due to their biological plausibility and energy efficiency. To address the resource constraints associated with using field programmable gate arrays (FPGAs) for numerical recognition in SNNs, we proposed a lightweight spiking efficient attention neural network (SeaSNN) accelerator. We designed a simple, four-layer structured network, achieving a recognition accuracy of 93.73% through software testing on the MNIST dataset. To further enhance the model’s accuracy, we developed a highly spiking efficient channel attention mechanism (SECA), resulting in a significant performance improvement and an increase in test accuracy to 94.28%. For higher recognition speed, we optimized circuit parallelism by applying techniques such as loop unrolling, loop pipelining, and array partitioning. Finally, SeaSNN was implemented and verified on an FPGA board, achieving an inference speed of 0.000401 seconds per frame and a power efficiency of 0.42 TOPS/W at a frequency of 200 MHz. These results demonstrate that the proposed low-power, high-precision, and fast handwritten digit recognition system is well-suited for handwritten digit recognition tasks.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/foods14152612
An Integrated Lightweight Neural Network Design and FPGA-Accelerated Edge Computing for Chili Pepper Variety and Origin Identification via an E-Nose.
  • Jul 25, 2025
  • Foods (Basel, Switzerland)
  • Ziyu Guo + 6 more

A chili pepper variety and origin detection system that integrates a field-programmable gate array (FPGA) with an electronic nose (e-nose) is proposed in this paper to address the issues of variety confusion and origin ambiguity in the chili pepper market. The system uses the AIRSENSE PEN3 e-nose from Germany to collect gas data from thirteen different varieties of chili peppers and two specific varieties of chili peppers originating from seven different regions. Model training is conducted via the proposed lightweight convolutional neural network ChiliPCNN. By combining the strengths of a convolutional neural network (CNN) and a multilayer perceptron (MLP), the ChiliPCNN model achieves an efficient and accurate classification process, requiring only 268 parameters for chili pepper variety identification and 244 parameters for origin tracing, with 364 floating-point operations (FLOPs) and 340 FLOPs, respectively. The experimental results demonstrate that, compared with other advanced deep learning methods, the ChiliPCNN has superior classification performance and good stability. Specifically, ChiliPCNN achieves accuracy rates of 94.62% in chili pepper variety identification and 93.41% in origin tracing tasks involving Jiaoyang No. 6, with accuracy rates reaching as high as 99.07% for Xianjiao No. 301. These results fully validate the effectiveness of the model. To further increase the detection speed of the ChiliPCNN, its acceleration circuit is designed on the Xilinx Zynq7020 FPGA from the United States and optimized via fixed-point arithmetic and loop unrolling strategies. The optimized circuit reduces the latency to 5600 ns and consumes only 1.755 W of power, significantly improving the resource utilization rate and processing speed of the model. This system not only achieves rapid and accurate chili pepper variety and origin detection but also provides an efficient and reliable intelligent agricultural management solution, which is highly important for promoting the development of agricultural automation and intelligence.

  • Research Article
  • 10.71097/ijsat.v16.i2.4319
Real-time Compilation and Performance Monitoring for High-Performance Systems
  • May 1, 2025
  • International Journal on Science and Technology
  • Aadithya P Goutham - + 2 more

High Performance Computing applications are diverse and they operate in dynamic environments. This requires a shift in compilation techniques from static, hardcoded algorithm-driven approaches to dynamic, real-time optimizing strategies. Current compiler methodologies primarily rely on static code analysis to apply optimizations during the compilation phase. Optimization techniques such as loop unrolling, vectorization, and function inlining are only effective for predictable workloads. However, they lack the adaptive ability in changing runtime conditions and hardware variability. System specific optimizations can be manually applied by the programmer, but this makes the code less portable and requires rewrite of entire programs, thus increasing the cost of maintenance. Our approach includes a real-time performance monitoring system that can trigger a recompilation dynamically to change code execution patterns. Runtime feedback is used to identify bottlenecks such as core overutilization, cache inefficiencies, and memory bottlenecks. A distinctive feature of the system is feedback-driven real-time compiler optimization. Whereas the performance benefits of dynamically compiled code is offset by the overhead incurred from the monitoring and recompilation system, the overall efficiency of the program throughout its runtime improves incrementally over each iteration of dynamically recompiled code. This efficiency improvement can also lead to energy savings in terms of reduction in wasted computational resources. The work presented here lays the foundations for adaptable and feedback-driven compiler optimization strategies.

  • Research Article
  • Cite Count Icon 7
  • 10.1109/jssc.2024.3517333
A 5-nm 60-GS/s 7b 64-Way Time Interleaved Partial Loop Unrolled SAR ADC Achieving 35.2dB SNDR up to 32 GHz
  • Apr 1, 2025
  • IEEE Journal of Solid-State Circuits
  • Claudio Nani + 18 more

A 60-GS/s 7b 64-way time interleaved (TI) analog-to-digital converter (ADC) with analog front end (AFE) is described. The presented converter features a non-binary partial loop unrolled (LU) SAR SubADC architecture that leverages multiple comparators, thus enabling better tradeoff between noise and power compared to conventional SAR. Offsets mismatches among comparators of each SubADC are calibrated in background by detecting patterns in the SAR output decisions. This results in no need for any analog hardware reconfigurability or additional phase overhead. Fabricated in 5-nm technology, the prototype AFE and ADC deliver 35.5 and 35.2dB signal to noise and distortion ratio (SNDR) till 20 and 32 GHz, respectively, and draw 109.3 mW from 0.9 V supply.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.3390/computers14020071
Energy Implications of Mitigating Side-Channel Attacks on Branch Prediction
  • Feb 16, 2025
  • Computers
  • Fahad Alqurashi + 3 more

Spectre variants 1 and 2 pose grave security threats to dynamic branch predictors in modern CPUs. While extensive research has focused on mitigating these attacks, little attention has been given to their energy and power implications. This study presents an empirical analysis of how compiler-based Spectre mitigation techniques influence energy consumption. We collect fine-grained energy readings from an HPC-class CPU via embedded sensors, allowing us to quantify the trade-offs between security and power efficiency. By utilizing a standard suite of microbenchmarks, we evaluate the impact of Spectre mitigations across three widely used compilers, comparing them to a no-mitigation baseline. The results show that energy consumption varies significantly depending on the compiler and workload characteristics. Loop unrolling influences power consumption by altering branch distribution, while speculative execution, when unrestricted, plays a role in conserving energy. Since Spectre mitigations inherently limit speculative execution, they should be applied selectively to vulnerable code patterns to optimize both security and power efficiency. Unlike previous studies that primarily focus on security effectiveness, this work uniquely evaluates the energy costs associated with Spectre mitigations at the compiler level, offering insights for power-efficient security strategies. Our findings underscore the importance of tailoring mitigation techniques to application needs, balancing performance, energy consumption, and security. The study provides practical recommendations for compiler developers to build more secure and energy-efficient software.

  • Research Article
  • 10.3390/app15042021
Optimizing Lattice Basis Reduction Algorithm on ARM V8 Processors
  • Feb 14, 2025
  • Applied Sciences
  • Ronghui Cao + 6 more

The LLL (Lenstra–Lenstra–Lovász) algorithm is an important method for lattice basis reduction and has broad applications in computer algebra, cryptography, number theory, and combinatorial optimization. However, current LLL algorithms face challenges such as inadequate adaptation to domestic supercomputers and low efficiency. To enhance the efficiency of the LLL algorithm in practical applications, this research focuses on parallel optimization of the LLL_FP (LLL double-precision floating-point type) algorithm from the NTL library on the domestic Tianhe supercomputer using the Phytium ARM V8 processor. The optimization begins with the vectorization of the Gram–Schmidt coefficient calculation and row transformation using the SIMD instruction set of the Phytium chip, which significantly improve computational efficiency. Further assembly-level optimization fully utilizes the low-level instructions of the Phytium processor, and this increases execution speed. In terms of memory access, data prefetch techniques were then employed to load necessary data in advance before computation. This will reduce cache misses and accelerate data processing. To further enhance performance, loop unrolling was applied to the core loop, which allows more operations per loop iteration. Experimental results show that the optimized LLL_FP algorithm achieves up to a 42% performance improvement, with a minimum improvement of 34% and an average improvement of 38% in single-core efficiency compared to the serial LLL_FP algorithm. This study provides a more efficient solution for large-scale lattice basis reduction and demonstrates the potential of the LLL algorithm in ARM V8 high-performance computing environments.

  • Research Article
  • 10.7868/s3034584725060047
LOW-LEVEL CODE SEMANTIC EQUIVALENCE CHECKER
  • Jan 1, 2025
  • Программирование / Programming and Computer Software
  • P.A Putro

The main difficulty in the function equivalence checking procedure is the presence of loops in the functions being checked. If there are no loops, it is possible to simply construct SMT formulas for these functions, link their inputs and outputs according to ABI, and pass them to SMT-solver for equivalence checking. However, in the presence of loops, it is also necessary to provide sufficient invariants of these loops, which is in general an undecidable task and requires human involvement. When proving equivalivalence of programs at the machine code level, manual inference of invariants can become difficult due to its low readability, even when considering the use of a disassembler. This justifies the need for research on creating approaches to automated program equivalence proofs. This paper presents an approach for automated equivalence checking of machine code level functions with loops, which allows to prove equivalalence of such loop optimizations as loop unrolling and loop pealing. The approach uses – a new method of function alignment based on the analysis of the reachability of basic blocks of code. Function alignment is a necessary step for the subsequent procedure of invariant inference, which is possible only if loops are equalized by the number of iterations. The approach uses SyGuS-solver for invariant inference, which also distinguishes it from other existing equivalence proof approaches that use their own solutions for this purpose.

  • Research Article
  • 10.1587/transinf.2025pap0007
Loop Unrolling and DFG Partitioning for CGRAs: A Case Study of The Lattice Boltzmann Method
  • Jan 1, 2025
  • IEICE Transactions on Information and Systems
  • Toshiyuki Ichiba + 2 more

Driven by the strong demand for enhanced performance in High-Performance Computing (HPC), Coarse-Grained Reconfigurable Architectures (CGRAs) are promising technologies that offer high performance even under power consumption constraints. Performance on CGRAs is significantly influenced by loop unrolling, a technique that increases computational parallelism by utilizing more processing elements in CGRAs. Determining the optimal loop unrolling factor is challenging in applications with multiple loops. This paper presents a case study demonstrating the determination of optimal loop unrolling factors for an application based on the Lattice Boltzmann Method (LBM). Because the application's process exceeds the capacity of a single CGRA, this paper proposes a method for partitioning the process to fit the CGRA's resources using integer linear programming (ILP). Finally, this paper provides a performance estimation of the CGRAs runtime and demonstrates the effectiveness of CGRAs for HPC.

  • Research Article
  • 10.1049/ell2.70242
Loop‐Unrolled SAR ADC With Complementary Voltage‐to‐Time Converters
  • Jan 1, 2025
  • Electronics Letters
  • Da‐Yeon Kim + 4 more

ABSTRACTA 6‐bit asynchronous loop‐unrolled (LU) successive approximation register (SAR) analogue‐to‐digital converter (ADC) with a complementary voltage‐to‐time converter (CVTC) and the efficient latch technique needed for this structure are proposed. The proposed structure utilises CVTC to reduce the power consumed by the reset operation and halves the operating frequency of the CVTC. Designed with a 500‐nm CMOS process, the 6‐bit 10‐MS/s LU SAR ADC shows a power saving of 23.66% compared to the VTC‐based LU SAR ADC.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.compbiomed.2024.109258
Lightweight skin cancer detection IP hardware implementation using cycle expansion and optimal computation arrays methods
  • Oct 23, 2024
  • Computers in Biology and Medicine
  • Qikang Li + 6 more

Lightweight skin cancer detection IP hardware implementation using cycle expansion and optimal computation arrays methods

  • Research Article
  • 10.18280/ijcmem.120308
Optimizing Program Efficiency by Predicting Loop Unroll Factors Using Ensemble Learning
  • Sep 30, 2024
  • International Journal of Computational Methods and Experimental Measurements
  • Esraa H Alwan + 1 more

Optimizing Program Efficiency by Predicting Loop Unroll Factors Using Ensemble Learning

  • Research Article
  • 10.1142/s0218126625500355
Machine Learning-Driven GCC Loop Unrolling Optimization: Compiler Performance Enhancement Strategy Based on XGBoost
  • Sep 23, 2024
  • Journal of Circuits, Systems and Computers
  • Zhaoyi Shi + 2 more

In contemporary compilers, the determination of the loop unrolling factor is traditionally based on manually crafted heuristic rules. This approach heavily relies on human intuition, which limits its ability to achieve optimized performance across diverse architectures and can sometimes even lead to performance declines. Additionally, developers face challenges in achieving cross-platform compatibility, often necessitating extensive redesign efforts. In response, this study introduces a method leveraging the XGBoost algorithm to predict the optimal loop unrolling factor for compiler optimization, thereby aiming to replace human thinking with machine learning methods and standardize development processes. Initially, the study gathers data on the loop unrolling factors as determined by profile guided optimization technology, analyzes program-specific loop feature vectors and employs cross-validation, including the Pearson correlation coefficient and feature importance ranking, to construct a dataset. Subsequent use of XGBoost to train this dataset models the decision-making process for selecting the most effective loop unrolling factor. The final step involves integrating XGBoost’s trained decision tree model into GCC to calculate the optimal loop unrolling factor during actual compilation. Empirical results on the RISC-V platform indicate that this new method, when tested against the SPEC CPU 2006 benchmark suite, offers up to 6.18% improvement in performance over the existing heuristic approach. It provides a new method for loop unrolling in compilers, and provides an innovative guide for the application of machine learning in compilers.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/tc.2024.3398424
TensorMap: A Deep RL-Based Tensor Mapping Framework for Spatial Accelerators
  • Aug 1, 2024
  • IEEE Transactions on Computers
  • Fuyu Wang + 3 more

The mapping of tensor computation is a complex and important process for spatial accelerators. Today's mapping works depend on hand-tuned kernel libraries or search-based heuristics from human experts. The former is time-intensive while the latter easily leads to sub-optimal performance. In this paper, we propose TensorMap, a deep reinforcement learning (RL)-based mapping framework for tensor computations on spatial accelerators. We propose a sequential generation mode for mapping optimization and construct a coarse-grained action space to reduce the complexity of the mapping search space. An efficient policy network is devised to optimize mapping primitives in the RL-based search. We then propose a stop signal that is sampled from <i>Bernoulli</i> distribution to facilitate multi-level loop unrolling for spatial accelerators. Finally, a genetic algorithm is employed to further refine the optimized mappings. In the experiments, we demonstrate TensorMap's ability for different spatial accelerators with various tensor computations. On TPU, TensorMap provides 2.6<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula>, 2.7<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula>, and 2.4<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula> better energy-delay product (EDP) on average compared with FlexTensor, Ansor, and AMOS respectively. On Eyeriss, TensorMap provides 2.1<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula>, 1.8<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula>, and 1.7<inline-formula><tex-math notation="LaTeX">$\times$</tex-math></inline-formula> better EDP on average compared with FlexTensor, Ansor, and AMOS respectively.

  • Research Article
  • Cite Count Icon 1
  • 10.55041/ijsrem35342
AES 128 Bit Optimization: High-Speed and Area-Efficient through Loop Unrolling
  • Jun 2, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Sandarbh Yadav,

This study introduces a high-throughput FPGA implementation of AES-128, prioritizing efficiency for robust security and fast data processing needs. AES-128 is renowned for its security and widespread use in various applications. Employing techniques like loop unrolling and pipelining, the implementation maximizes throughput and customizes AES for FPGA architectures. A novel optimization approach, "new-affine-transformation," reduces resource demands and latency for the Sub-Bytes function. The AES architecture is strategically modified for efficiency, with rearranged functions and streamlined processing. The implementation, in VHDL and utilizing Xilinx Virtex-5 FPGA, achieves remarkable performance: 37.9 Gbps (encryption) and 38.5 Gbps (decryption) throughput at frequencies of 296.789 MHz (encryption) and 300.806 MHz (decryption). Resource utilization is efficient, with 264 (encryption) and 260 (decryption) slice registers and 1044 (encryption) and 1581 (decryption) total slices. Keywords: AES, FPGA, cryptography, encryption, decryption, throughput, plain text, cipher text

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • .
  • .
  • .
  • 10
  • 1
  • 2
  • 3
  • 4
  • 5

Popular topics

  • Latest Artificial Intelligence papers
  • Latest Nursing papers
  • Latest Psychology Research papers
  • Latest Sociology Research papers
  • Latest Business Research papers
  • Latest Marketing Research papers
  • Latest Social Research papers
  • Latest Education Research papers
  • Latest Accounting Research papers
  • Latest Mental Health papers
  • Latest Economics papers
  • Latest Education Research papers
  • Latest Climate Change Research papers
  • Latest Mathematics Research papers

Most cited papers

  • Most cited Artificial Intelligence papers
  • Most cited Nursing papers
  • Most cited Psychology Research papers
  • Most cited Sociology Research papers
  • Most cited Business Research papers
  • Most cited Marketing Research papers
  • Most cited Social Research papers
  • Most cited Education Research papers
  • Most cited Accounting Research papers
  • Most cited Mental Health papers
  • Most cited Economics papers
  • Most cited Education Research papers
  • Most cited Climate Change Research papers
  • Most cited Mathematics Research papers

Latest papers from journals

  • Scientific Reports latest papers
  • PLOS ONE latest papers
  • Journal of Clinical Oncology latest papers
  • Nature Communications latest papers
  • BMC Geriatrics latest papers
  • Science of The Total Environment latest papers
  • Medical Physics latest papers
  • Cureus latest papers
  • Cancer Research latest papers
  • Chemosphere latest papers
  • International Journal of Advanced Research in Science latest papers
  • Communication and Technology latest papers

Latest papers from institutions

  • Latest research from French National Centre for Scientific Research
  • Latest research from Chinese Academy of Sciences
  • Latest research from Harvard University
  • Latest research from University of Toronto
  • Latest research from University of Michigan
  • Latest research from University College London
  • Latest research from Stanford University
  • Latest research from The University of Tokyo
  • Latest research from Johns Hopkins University
  • Latest research from University of Washington
  • Latest research from University of Oxford
  • Latest research from University of Cambridge

Popular Collections

  • Research on Reduced Inequalities
  • Research on No Poverty
  • Research on Gender Equality
  • Research on Peace Justice & Strong Institutions
  • Research on Affordable & Clean Energy
  • Research on Quality Education
  • Research on Clean Water & Sanitation
  • Research on COVID-19
  • Research on Monkeypox
  • Research on Medical Specialties
  • Research on Climate Justice
Discovery logo
FacebookTwitterLinkedinInstagram

Download the FREE App

  • Play store Link
  • App store Link
  • Scan QR code to download FREE App

    Scan to download FREE App

  • Google PlayApp Store
FacebookTwitterTwitterInstagram
  • Universities & Institutions
  • Publishers
  • R Discovery PrimeNew
  • Ask R Discovery
  • Blog
  • Accessibility
  • Topics
  • Journals
  • Open Access Papers
  • Year-wise Publications
  • Recently published papers
  • Pre prints
  • Questions
  • FAQs
  • Contact us
Lead the way for us

Your insights are needed to transform us into a better research content provider for researchers.

Share your feedback here.

FacebookTwitterLinkedinInstagram
Cactus Communications logo

Copyright 2026 Cactus Communications. All rights reserved.

Privacy PolicyCookies PolicyTerms of UseCareers