Study and evaluation of automatic offloading for function blocks of applications
Systems using graphical processing units (GPUs) and field-programmable gate arrays (FPGAs) have increased due to their advantages over central processing units (CPUs). However, such systems require the understanding of hardware-specific technical specifications such as Hardware Description Language (HDL) and compute unified device architecture (CUDA), which is a high hurdle. Based on this background, we previously proposed environment-adaptive software that enables automatic conversion, configuration and high-performance operation of existing code according to the hardware to be placed. As an element of this concept, we also proposed a method of automatically offloading loop statements of application source code for CPUs to GPUs and FPGAs. In this paper, we propose a method for offloading a function block, which is a larger unit, instead of individual loop statements in an application to achieve higher speed by automatically offloading to GPUs and FPGAs. We implemented the proposed method and evaluated it using current applications offloading to GPUs and FPGAs.
- Conference Article
24
- 10.1145/3395245.3396200
- Mar 28, 2020
Recently, utilization of hardware other than CPU (Central Processing Unit) such as GPU (Graphics Processing Unit) or FPGA (Field-Programmable Gate Array) is increasing including education field. However, when using heterogeneous hardware other than CPUs, barriers of technical skills such as CUDA (Compute Unified Device Architecture) and HDL (Hardware Description Language) are high. Based on that, I have proposed environment adaptive software that enables automatic conversion, configuration, and high-performance operation of once written code, according to the hardware to be placed. Partly of the offloading to the GPU and FPGA was automated previously. In this paper, I improve and propose a previous automatic GPU offloading method to expand applicable software and enhance performances more. I evaluate the effectiveness of the proposed method in multiple applications.
- Research Article
2
- 10.14279/depositonce-7180
- Jan 1, 2015
The usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which expect high throughput. The paper investigates the usage of GPUs for running space borne image data compression algorithms, in particular the CCSDS 122.0-B-1 standard as a case study. The paper proposes an architecture to parallelize the Bit-Plane Encoder (BPE) stage of the CCSDS 122.0-B-1 in lossless mode using a GPU to achieve high throughput performance to facilitate real-time compression of satellite image data streams. Experimental results are furnished by comparing the performance in terms of compression time of the GPU implementation versus a state of the art single threaded CPU and an field-programmable gate array (FPGA implementation. The GPU implementation on a NVIDIA® GeForce® GTX 670 achieves a peak throughput performance of 162.382 Mbyte/s (932.288 Mbit/s) and an average speed-up of at least compared to the software implementation running on a 3.47 GHz single core Intel® Xeon™ processor. The high throughput CUDA implementation using GPUs could potentially be suitable for air borne and space borne applications in the future, if the GPU technology evolves to become radiation-tolerant and space-qualified.
- Research Article
2
- 10.1080/23311916.2022.2080624
- Jun 8, 2022
- Cogent Engineering
Heterogeneous hardware other than a small-core central processing unit (CPU) such as a graphics processing unit (GPU), field-programmable gate array (FPGA), or multi-core CPU is increasingly being used. However, to use heterogeneous hardware, programmers must have sufficient technical skills to utilize OpenMP, CUDA, and OpenCL. On the basis of this, we have proposed an environment-adaptive software that enables automatic conversion, configuration, and high-performance operation of once written code, in accordance with the hardware to be placed. However, no techniques have been developed to properly and automatically offload applications in the mixed offloading destination environment such as GPU, FPGA, and multi-core CPU. In this paper, for a new element of environment-adaptive software, we study a method for offloading applications properly and automatically in an environment where the offloading destination is a mix of GPU, FPGA, and multi-core CPU. We evaluate the effectiveness of the proposed method in multiple applications.
- Conference Article
6
- 10.1109/mcsoc.2019.00050
- Oct 1, 2019
GPU (Graphics Processing Unit) and CPU (Central Processing Unit) possess a sufficient and appropriate performance to compute massively parallel applications like AI, Big data, and material sciences. However, their real performance is far lower than those theoretical ones. The primary reason for the performance degradation is that they suffer from limited memory bandwidth and inefficient interconnection topology not optimized for these types of applications. Thus, from the viewpoint of real computational performance called computational efficiency, FPGA (Field Programmable Gate Array) is now becoming an attractive chip for these types of applications with massively parallel computation. FPGA can efficiently propose optimized communication and bridge different computing accelerators as customized hardware. In other words, FPGA-based hardware accelerators offer a convenient solution for both high performance and high memory bandwidth. However, one serious concern is usability. For example, the FPGA design using hardware description language is a meticulous task and requires specialized skill sets as well as a long time to market. An overlay architecture will become an appropriate candidate that can resolve this issue because it offers a software layer that simplifies FPGA programmability by abstracting the fabric resources. Thus, this article proposes an overlay architecture based on a tightly-connected many-core-based CGRA (Coarse-Grained Reconfigurable Architecture). It will help software engineers on seamlessly implementing their applications. Our final goal is not on the current fine-grained FPGAs but new middle-to-course-grained programmable chips. If an ASIC (Application-Specific Integrated Circuit) implementation was adopted, the performance would achieve at least ten times higher compared with the current FPGA implementation because of the working frequency. In this article, the proposed overlay system provides a programmable interface that virtualizes FPGA resources and let prospected users focus on high-level software programming.
- Conference Article
2
- 10.7148/2012-0399-0404
- May 29, 2012
General purpose graphic programming unit (GPGPU) programming is a novel approach for solving parallel variable independent problems. The graphic processor core (GPU) gives the possibility to use multiple blocks, each of which contains hundreds of threads. Each of these threads can be visualized as a core onto itself, and tasks can be simultaneously sent to all the threads for parallel evaluations. This research explores the advantages of applying a evolutionary algorithm (EA) on the GPU in terms of computational speedups. Enhanced Differential Evolution (EDE) is applied to the generic permutative flowshop scheduling (PFSS) problem both using the central processing unit (CPU) and the GPU, and the results in terms of execution time is compared. INTRODUCTION During the later part of the past decade, a novel trend emerged where programmers started using the Graphics Processing Unit (GPU) for programming not graphic applications which usually was in the preview of the Central Processing Unit (CPU). The reasoning behind such a move was the possibility to achieving speedups of magnitude compared to optimized CPU implementations. GPU’s have evolved into fast, highly multi-threaded processors, with hundreds of cores and thousands of concurrent threads. These threads which can be invoked simultaneously, provide an excellent platform for parallel execution. A GPU is optimal when a problem has to be executed many times, can be isolated as a function and works independently on different data. One of the most challenging and computational demanding problems in engineering are the NP-Hard problems. These problems are computationally intractable, and often require the use of optimization algorithms. This research attempts to solve the challenging flowshop scheduling (FSS) problem using a novel Enhanced Differential Evolution (EDE) algorithm utilizing GPU programming. One of the most widespread programming architectures is the Compute Unified Device Architecture (CUDA) of Nvidia (NVIDIA, 2012). A number of research has been conducted on GPU programming involving evolutionary algorithms and these two architectures. Tabu Search has been used for the evaluating the FSS problem using CUDA by Czapinski and Barnes (2011). Genetic Algorithms (GA) has been been used to solve the traveling salesman problem by Chen et al. (2011), whereas a parallel GA approach has been done by Pospichal et al. (2010). The particle swarm algorithm has also been modified to be used by CUDA Mussi et al. (2011). More interestingly Genetic Programming has also found a niche in GPU programming (Robilliard et al., 2009). This research utilizes the Nvidia CUDA framework for GPU computation. The enhanced Differential Evolution (EDE) (Davendra and Onwubolu, 2009) is modified to the GPU framework and execution time for both the GPU and CPU variants are compared. This paper follows the following structure. Section 1 outlines the CUDA framework and syntax. Section 2 describes Differential Evolution (DE) and the EDE algorithms. The problem attempted in this research; flow shop scheduling is given in Section 3. Section 4 describes the code design on the GPU, whereas the experimentation and analysis (Section 5) compares the obtained results. The paper is concluded in Section 6. Proceedings 26th European Conference on Modelling and Simulation ©ECMS Klaus G. Troitzsch, Michael Mohring, Ulf Lotzmann (Editors) ISBN: 978-0-9564944-4-3 / ISBN: 978-0-9564944-5-0 (CD)
- Research Article
15
- 10.1080/17445760.2021.1971666
- Sep 4, 2021
- International Journal of Parallel, Emergent and Distributed Systems
Heterogeneous hardware other than a small-core central processing unit (CPU) is increasingly being used, such as a graphics processing unit (GPU), field-programmable gate array (FPGA) or many-core CPU. However, to use heterogeneous hardware, programmers must have sufficient technical skills to utilise OpenMP, CUDA, and OpenCL. On the basis of this, we previously proposed environment-adaptive software that enables automatic conversion, configuration, and high performance operation of once-written code, in accordance with the hardware to be placed. However, the source language for offloading was mainly C/C++ language applications, and there was no research into common offloading for various language applications. In this paper, for a new challenge, we study a common method for automatically offloading various language applications in not only C language but also Python and Java. We evaluate the effectiveness of the proposed method in multiple applications of various languages.
- Research Article
3
- 10.4233/uuid:f785ddec-e2d2-4209-a6a1-81a3cfdd57b6
- Mar 3, 2015
- Research Repository (Delft University of Technology)
Reduction of computing time for seismic applications based on the Helmholtz equation by Graphics Processing Units
- Conference Article
8
- 10.1109/iceei.2017.8312430
- Nov 1, 2017
Graphics Processing Units (GPU) in the last decade has been progressing very rapidly. The hardware originally used for image processing displayed on the screen has shifted into a device for computing in parallel (general purpose GPU). GPU can also be used to perform radar data processing either in the stage of signal processing or in the data processing stage. This is done because the radar data is processed in large size and the computation process allows to be parallelized. Previously, radar data processing was performed using a specialized digital signal processor (DSP) device and/or field-programmable gate array (FPGA). But the cost required for both types of devices is more expensive than the GPU. In addition, both have low scalability. GPU use is a compromise solution compared to using DSP or FPGA because GPU can cover the above weaknesses although on the other hand GPU power consumption is not as good as DSP and FPGA. This paper examines the extent to which the GPU has been used in radar signal processing and radar data processing. Several studies have used GPU for radar signal and data processing algorithm implementations on the GPU compared to using the usual Central Processing Unit (CPU). The comparison results show the GPU performance is much better than the CPU. Speedup relative to the CPU has reached the double-digit level.
- Research Article
2
- 10.3390/jlpea15030040
- Jul 21, 2025
- Journal of Low Power Electronics and Applications
This study presents a comprehensive performance evaluation of field-programmable gate array (FPGA), graphics processing unit (GPU), and central processing unit (CPU) platforms for implementing finite impulse response (FIR) filters in semiconductor-based digital signal processing (DSP) systems. Utilizing a standardized FIR filter designed with the Kaiser window method, we compare computational efficiency, latency, and energy consumption across the ZYNQ XC7Z020 FPGA, Tesla K80 GPU, and Arm-based CPU, achieving processing times of 0.004 s, 0.008 s, and 0.107 s, respectively, with FPGA power consumption of 1.431 W and comparable energy profiles for GPU and CPU. The FPGA is 27 times faster than the CPU and 2 times faster than the GPU, demonstrating its suitability for low-latency DSP tasks. A detailed analysis of resource utilization and scalability underscores the FPGA’s reconfigurability for optimized DSP implementations. This work provides novel insights into platform-specific optimizations, addressing the demand for energy-efficient solutions in edge computing and IoT applications, with implications for advancing sustainable DSP architectures.
- Conference Article
3
- 10.1109/asap.2019.00014
- Jul 1, 2019
Base64 encoding has many applications on the Web. Previous studies investigated the optimizations of Base64 encoding algorithm on central processing units (CPUs). In this paper, we describe the optimizations of the algorithm on heterogeneous computing platforms. More specifically, we explain the algorithm, convert the algorithm to kernels written in CUDA C/C++ and Open Computing Language (OpenCL), optimize the CUDA and OpenCL applications with CUDA and OpenCL streams which can overlap data transfers with kernel computations, and vectorize the CUDA and OpenCL kernels to improve kernel throughput. We evaluate the impact of the number of streams upon the kernel performance on an NVIDIA Pascal P100 graphics processing unit (GPU) and a Nallatech 385A card that features an Intel Arria 10 GX1150 field-programmable gate array (FPGA). We also measure the performance and power of the applications on the CPU, GPU, and FPGA to know the advantage of each platform and the benefit of kernel offloading. The experiments show that using vector data types in the kernels is not for performance, and more work-items is better than large vectors per work-item on the GPU. OpenCL and CUDA streams can achieve almost the same performance on the GPU, but streams should be used with caution when GPU resources are underutilized. On the FPGA, kernel vectorization using 16 vector lanes can achieve the highest performance when the number of streams is one. However, increasing the vector width per work-item and the number of streams can decrease the kernel computation time for each stream, and thereby reduce the number of concurrent operations across the streams. While the raw performance on the GPU is 3.1X higher than that on the FPGA, the FPGA consumes 3.4X less power. A comparison with a state-of-the-art implementation on an Intel CPU server shows an increasing benefit of kernel offloading.
- Supplementary Content
4
- 10.25534/tuprints-00009288
- Nov 30, 2019
- Americanae (AECID Library)
Molecular Docking (MD) is a key tool in computer-aided drug design that aims to predict the binding pose between a small molecule and a macromolecular target. At its core, MD calculates the strength of possible binding poses, and searches for the energetically-stronger ones among those generated during simulation. Automatic Docking (AutoDock) is a widely-used MD code that employs a physics-based scoring function to quantify the binding strength. AutoDock also uses a Lamarckian Genetic Algorithm (LGA), and in turn, the Solis-Wets method, as a local-search algorithm, in order to find strong interactions of such molecular systems. Due to the highly-parallel nature of the LGA tasks involved, AutoDock can benefit from runtime acceleration based on parallelization. This thesis presents an OpenCL-based parallelization of AutoDock, and a corresponding evaluation in terms of execution performance, quality-of-results, and compute-energy efficiency, achieved on different platforms based on: multi-core Central Processing Unit (CPU)s, Graphics Processing Unit (GPU)s, and Field Programmable Gate Array (FPGA)s. While a data-parallel approach has proven its effectiveness in accelerating AutoDock on CPUs and GPUs, it was observed that for FPGAs, such approach resulted in slower executions in the range of three-orders of magnitude when compared against the original single-threaded AutoDock. To overcome this drawback, a task-parallel implementation for FPGAs is discussed as well. Besides presenting an AutoDock implementation being parallelized using OpenCL, this thesis also extends the LGA search with new alternative local-search methods based on gradients (of the scoring function) such as: Steepest-Descent, FIRE, and ADADELTA. Among these, it was found that ADADELTA provides significant algorithmic benefits over Solis-Wets, yielding a reduction in calculation effort down to 1/1300 of the legacy Solis-Wets method, while achieving equivalent quality-of-results. Compared to the original single-threaded AutoDock, the proposed data-parallel design achieves a speedup of up to ∼399x and improves the compute-energy efficiency by up to ∼297x when running on modern V100 GPUs. Furthermore, this thesis describes the adaptations performed on the proposed OpenCL-based implementation for supporting challenging real-world MD scenarios.
- Research Article
11
- 10.1109/tvlsi.2011.2128354
- May 1, 2012
- IEEE Transactions on Very Large Scale Integration (VLSI) Systems
This paper presents a novel parallel architecture for accelerating quadrature methods used for pricing complex multi-dimensional options, such as discrete barrier, Bermudan and American options. We explore different designs of the quadrature evaluation core including optimized pipelined hardware designs in reconfigurable logic and a compute unified device architecture (CUDA)-based graphics processing unit (GPU) design. A parametrizable automated system is presented for generating hardware quadrature evaluation cores with an arbitrary number of dimensions. The performance and energy consumption of field-programmable gate arrays (FPGAs), GPUs, and central processing units (CPUs) are compared across different number of dimensions and precisions. Our evaluation shows that the 100 MHz Virtex-4 xc4vlx160 FPGA design is 4.6 times faster and 25.9 times more energy efficient than a multi-threaded optimized software implementation running on a Xeon W3504 dual-core CPU. It is also 2.6 times faster and 25.4 times more energy efficient than a GPU with comparable silicon process technology.
- Research Article
32
- 10.1016/j.jpdc.2016.05.014
- May 31, 2016
- Journal of Parallel and Distributed Computing
FPGA, GPU, and CPU implementations of Jacobi algorithm for eigenanalysis
- Single Report
3
- 10.2172/1375449
- Jul 28, 2017
Compared to central processing units (CPUs) and graphics processing units (GPUs), field programmable gate arrays (FPGAs) have major advantages in reconfigurability and performance achieved per watt. This development flow has been augmented with high-level synthesis (HLS) flow that can convert programs written in a high-level programming language to Hardware Description Language (HDL). Using high-level programming languages such as C, C++, and OpenCL for FPGA-based development could allow software developers, who have little FPGA knowledge, to take advantage of the FPGA-based application acceleration. This improves developer productivity and makes the FPGA-based acceleration accessible to hardware and software developers. Xilinx Vivado HLS compiler is a high-level synthesis tool that enables C, C++ and System C specification to be directly targeted into Xilinx FPGAs without the need to create RTL manually. The white paper [1] published recently by Xilinx uses a finite impulse response (FIR) example to demonstrate the variable-precision features in the Vivado HLS compiler and the resource and power benefits of converting floating point to fixed point for a design. To get a better understanding of variable-precision features in terms of resource usage and performance, this report presents the experimental results of evaluating the FIR example using Vivado HLS 2017.1 and a Kintex Ultrascale FPGA. In addition, we evaluated the half-precision floating-point data type against the double-precision and single-precision data type and present the detailed results.
- Conference Article
1
- 10.1145/3289602.3293932
- Feb 20, 2019
Base64 encoding has many applications on the Web. Previous studies are focused on improving the efficiency of Base64 encoding on central processing units (CPUs). As field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing components in high-performance computing (HPC), and high-level synthesis (HLS) is more mature, we are motivated to optimize Base64 encoding on an FPGA using HLS. In this paper, we explain the algorithm, converts the algorithm to a kernel written in Open Computing Language (OpenCL), and optimize the kernel targeting an Intel Arria 10 FPGA. We evaluate the performance and power of the kernel implementations on the CPU, graphics processing units (GPUs), and FPGA computing platforms. The experimental results show that we can significantly improve the performance of Base64 encoding with the FPGA-specific optimizations. Compared to an Intel Xeon Platinum 8167 CPU, an Nvidia Tesla K80 GPU, and an Nvidia Tesla P100 GPU, the performance (the number of cycles per byte) of Base64 encoding on an Arria10-based FPGA platform is 3.98X higher than that on the K80 GPU, 17X higher than that on the CPU, and 1.83X lower than that on the P100 GPU for large input data sizes. The performance per watt on the FPGA is 1.1X lower than that on the P100 GPU, and 8.25X and 13.2X higher than that on the CPU and the K80 GPU, respectively.