Optimal loop unrolling for GPGPU programs
Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIA's CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs. We develop a semi-automatic, compile-time approach for identifying optimal unroll factors for suitable loops in GPGPU programs. In addition, we propose techniques for reducing the number of unroll factors evaluated, based on the characteristics of the program being compiled and the device being compiled to. We use these techniques to evaluate the effect of loop unrolling on a range of GPGPU programs and show that we correctly identify the optimal unroll factors. The optimized versions run up to 70 percent faster than the unoptimized versions.
- Conference Article
2
- 10.7148/2012-0399-0404
- May 29, 2012
General purpose graphic programming unit (GPGPU) programming is a novel approach for solving parallel variable independent problems. The graphic processor core (GPU) gives the possibility to use multiple blocks, each of which contains hundreds of threads. Each of these threads can be visualized as a core onto itself, and tasks can be simultaneously sent to all the threads for parallel evaluations. This research explores the advantages of applying a evolutionary algorithm (EA) on the GPU in terms of computational speedups. Enhanced Differential Evolution (EDE) is applied to the generic permutative flowshop scheduling (PFSS) problem both using the central processing unit (CPU) and the GPU, and the results in terms of execution time is compared. INTRODUCTION During the later part of the past decade, a novel trend emerged where programmers started using the Graphics Processing Unit (GPU) for programming not graphic applications which usually was in the preview of the Central Processing Unit (CPU). The reasoning behind such a move was the possibility to achieving speedups of magnitude compared to optimized CPU implementations. GPU’s have evolved into fast, highly multi-threaded processors, with hundreds of cores and thousands of concurrent threads. These threads which can be invoked simultaneously, provide an excellent platform for parallel execution. A GPU is optimal when a problem has to be executed many times, can be isolated as a function and works independently on different data. One of the most challenging and computational demanding problems in engineering are the NP-Hard problems. These problems are computationally intractable, and often require the use of optimization algorithms. This research attempts to solve the challenging flowshop scheduling (FSS) problem using a novel Enhanced Differential Evolution (EDE) algorithm utilizing GPU programming. One of the most widespread programming architectures is the Compute Unified Device Architecture (CUDA) of Nvidia (NVIDIA, 2012). A number of research has been conducted on GPU programming involving evolutionary algorithms and these two architectures. Tabu Search has been used for the evaluating the FSS problem using CUDA by Czapinski and Barnes (2011). Genetic Algorithms (GA) has been been used to solve the traveling salesman problem by Chen et al. (2011), whereas a parallel GA approach has been done by Pospichal et al. (2010). The particle swarm algorithm has also been modified to be used by CUDA Mussi et al. (2011). More interestingly Genetic Programming has also found a niche in GPU programming (Robilliard et al., 2009). This research utilizes the Nvidia CUDA framework for GPU computation. The enhanced Differential Evolution (EDE) (Davendra and Onwubolu, 2009) is modified to the GPU framework and execution time for both the GPU and CPU variants are compared. This paper follows the following structure. Section 1 outlines the CUDA framework and syntax. Section 2 describes Differential Evolution (DE) and the EDE algorithms. The problem attempted in this research; flow shop scheduling is given in Section 3. Section 4 describes the code design on the GPU, whereas the experimentation and analysis (Section 5) compares the obtained results. The paper is concluded in Section 6. Proceedings 26th European Conference on Modelling and Simulation ©ECMS Klaus G. Troitzsch, Michael Mohring, Ulf Lotzmann (Editors) ISBN: 978-0-9564944-4-3 / ISBN: 978-0-9564944-5-0 (CD)
- Research Article
39
- 10.1155/2020/8862123
- Sep 25, 2020
- Scientific Programming
Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K−ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
- Research Article
74
- 10.1016/j.imu.2017.08.001
- Jan 1, 2017
- Informatics in Medicine Unlocked
Survey of using GPU CUDA programming model in medical image analysis
- Book Chapter
1
- 10.1007/978-981-19-6970-6_39
- Jan 1, 2023
Most Computational Fluid Dynamics (CFD) applications solve the Naiver-Stokes equations using various discretization methods like Finite Difference Method (FDM), Finite Element Method (FEM), Finite Volume Method (FVM), etc. The most compute-intensive process in CFD algorithms is the pressure Poisson solver. Solving the pressure Poisson equation requires the solution of a set of linear simultaneous equations. As the problem size/complexity increases, so does the effort and time invested in solving the equations. This study presents fast and robust iterative solvers and the acceleration of the same using the Graphics Processing Unit (GPU) architecture. Serial versions of the codes are implemented in the C language and the parallelization on GPU is achieved using Compute Unified Device Architecture (CUDA). For this study, a two-dimensional heat conduction problem is considered. FDM is used for discretization. Iterative solvers Conjugate Gradient and Multi Grid methods have been implemented in serial and parallel. The performance enhancement with single GPU and CPU system over a single CPU in terms of computing time is reported. It is seen that multigrid algorithms are superior in convergence and nearly 13 times speed-up has been obtained through GPU acceleration using CUDA, for a grid size of 2048 X 2048. The tests have been done on a cluster with Intel(R) Xeon(R) CPU E5-2670 CPU and NVIDIA K10 GPU.
- Research Article
9
- 10.1109/tcbb.2018.2814570
- Mar 12, 2018
- IEEE/ACM transactions on computational biology and bioinformatics
In computational biology, the hierarchy of biological systems requires the development of flexible and powerful computational tools. Graphics processing unit (GPU) architecture has been a suitable device for parallel computing in simulating multi-cellular systems. However, in modeling complex biological systems, scientists often face two tasks, mathematical formulation and skillful programming. In particular, specific programming skills are needed for GPU programming. Therefore, the development of an easy-to-use computational architecture, which utilizes GPU for parallel computing and provides intuitive interfaces for simple implementation, is needed so that general scientists can perform GPU simulations without knowing much about the GPU architecture. Here, we introduce ParaCells, a cell-centered GPU simulation architecture for NVIDIA compute unified device architecture (CUDA). ParaCells was designed as a versatile architecture that connects the user logic (in C++) with NVIDIA CUDA runtime and is specific to the modeling of multi-cellular systems. An advantage of ParaCells is its object-oriented model declaration, which allows it to be widely applied to many biological systems through the combination of basic biological concepts. We test ParaCells with two applications. Both applications are significantly faster when compared with sequential as well as parallel OpenMP and OpenACC implementations. Moreover, the simulation programs based on ParaCells are cleaner and more readable than other versions.
- Conference Article
11
- 10.2991/isca-13.2013.51
- Jan 1, 2013
This paper defines an Out Of Play model based on Markov Decision Process. The best path for playing can be found and recommended by using this model, and a value iteration algorithm of Markov Decision Process is used to implement the model. In this paper, the implementation of this model with CPU is presented. And then, in order to improve the performance of the value iteration algorithm, a parallel value iteration algorithm on GPU is designed and showed. For the calculation of a large amount of data, the experimental results show that the parallel value iteration algorithm on GPU improves performance far more than that of the serial value iteration algorithm on CPU. Introduction Graphic Processing Unit (GPU) attracts more and more attention in general-purpose computing with the development of graphics hardware. But now, GPU is not only used in graphic, it is also considered as a powerful technique for obtaining inexpensive, high performance parallelism [1, 2]. General-Purpose GPU is a highly parallel, multithreaded, many-core processor with a very high computational power and memory bandwidth [3]. GPU architecture is designed for optimization of massively parallel computing, because of this architectural difference, a GPU is in general more advantageous for large-scale parallel data processing applications than general-purpose CPUs[4], and the high performance computing community leveraging a GPU can yield performance increases of several orders of magnitude[5,6]. High-performance computing with GPUs is called GPU computing [7]. So using GPU for parallel computing will become a new focus for the purpose of speeding up the calculation. Markov Decision process is a stochastic dynamic system based on the theory of Markov process and decision-making process. In Markov decision process, the ultimate goal is to find an action for every state so that the performance of the system is the best. In this paper, in order to reach higher performance, a parallel implementation on GPU is given, and OpenCL is selected to program. Markov Decision Process Markov Model. Markov Decision process [8][9] can be defined as a four tuple . In the tuple, S is a finite set of states, and A is a finite set of actions; T is probability distribution ( means the probability of transition from state to state by taking action a); R is the reward function, and means the reward got when taking action from state to state . In MDP, there is a parameter which is a discount factor and is used to reduce the interference from the future actions. Value Iteration Algorithm. In order to solve Markov model, there are many algorithms proposed. In this paper, the value iteration algorithm of MDP is selected for studying. The main idea of value iteration algorithm [8] is iteration. In the algorithm, the target is to find the optimal policy via the optimal value iteration. First, we give an initial value function for every state, then we update the value function of every state to a next value function for every iteration until satisfying a condition. Fig 1 [8] is the pseudo of value iteration algorithm. International Conference on Information Science and Computer Applications (ISCA 2013) © 2013. The authors Published by Atlantis Press 299 Fig.1: The pseudo of value iteration algorithm Out of Play Model. According to Markov Model and the value iteration algorithm, this paper gives a MDP model—Out Of Play model. This model can be described as follows: Go out for playing, but we can’t decide where to go, or we don’t know which transportation (how to go: bus or walk) can be chosen to reach the destination. In fact, this can be described as a MDP. In the scene, choosing where to go and how to go are random and uncertain, and according to the uncertain destination and mode of transportation, a MDP model can be created. In the model, the different destinations can be described as the finite set of states, and the different transportations can be described as the finite set of actions. Every state has an initial reward which represents the fun level. Every state can take different actions to get to another state, and this meets a probability distribution. In the model, a satisfied optimal path must be found. Fig 2 is the whole system model.
- Supplementary Content
3
- 10.15480/882.1184
- Jan 1, 2014
- tub.dok (Hamburg University of Technology)
The Graphics Processing Unit (GPU) is a highly parallel, many-core streaming architecture that can execute hundreds of threads concurrently. The data parallel architecture of the GPU is suitable to perform computation intensive applications. In recent years, the use of GPUs for general purpose computation has increased and a large set of problems can be tackled by mapping onto GPUs. The programming model CUDA enables to design C like programs with some extensions which leverages programmers to efficiently use the graphics API. Alignment is the fundamental operation used to compare biological sequences and in this way to identify regions of similarity that are eventually consequences of structural, functional, or evolutionary relationships. Multiple sequence alignment is an important tool for the simultaneous alignment of three or more sequences. Efficient heuristics exist to cope with this problem. In the thesis, progressive alignment methods and their parallel implementation by GPUs are studied. More specifically, the dynamic programming algorithms of profile-profile and profile-sequence alignment are mapped onto GPU. Wavefront and matrix-matrix product techniques are discussed which can deal well with the data dependencies. The performance of these methods is analyzed. Simulations show that one order of magnitude of speed-up over the serial version can be achieved. ClustalW is the most widely used progressive sequence alignment method which aligns more closely related sequences first and then gradually adds more divergent sequences. It consists of three stages: distance matrix calculation, guide tree compilation, and progressive alignment. In this work, the efficient mapping of the alignment stage onto GPU by using a combination of wavefront and matrix-matrix product techniques has been studied. In the hidden Markov model, the Viterbi algorithm is used to find the most probable sequence of hidden states that has generated the observation. In the thesis, the parallelism exhibited by the compute intensive tasks is studied and a parallel solution based on the matrix-matrix product method onto GPU is devised. Moreover, the opportunity to use optimized BLAS library provided by CUDA is explored. Finally, the performance by fixing the number of states and changing the number of observations and vice versa is portrayed. At the end, general principles and guidelines for GPU programming of matrixmatrix product algorithms are discussed.
- Research Article
4
- 10.12694/scpe.v11i4.663
- Jan 1, 2010
- Scalable Computing Practice and Experience
CUDA by Example: An Introduction to General-Purpose GPU Programming Jason Sanders and Edward Kandrot ISBN-13: 978-0131387683 Addison-Wesley Professional; 1 edition (July 29, 2010) Introduction This book is designed for readers who are interested in studying how to develop general parallel applications on graphics processing unit (GPU) by using CUDA C. CUDA C is a programming language, which combines industry standard programming C language and some more features which can exploit CUDA architecture. With proper introduction to NVIDA's CUDA architecture and in depth explanation for setting up development environment, this book is an easy to read, easy to understand, and hands on book. Readers of this book are assumed to have at least C language as background. Through this book, readers will not only gain experience in CUDA C development languages, but also will understand a lot of important underlying hardware knowledge, which in return can help software developers develop more efficient and effective applications. Outline of the Book This book is very well organized. Each chapter consists of general introduction, chapter objectives and Chapter Review. Both Sanders and Kandrot are senior software engineers in the CUDA Platform group and CUDA Algorithm team in NVIDIA Company, respectively. First chapter provide users background about history of GPU and CUDA architecture. Special features in CUDA architecture enable GPU to perform general purpose computation in addition to carry out traditional graphic computation. Readers can easily understand the benefit of CUDA architecture by reading though three different applications varying from medical field to environmental filed. In Chapter 2, Sanders and Kandrot equip users with complete lists of hardware and software support for running CUDA C applications. All software can be downloaded for free from websites suggested from authors. Then by a familiar Hello world program in Chapter 3, authors demystified that the CUDA C fundamentally is a standard C language with additional features which can allow application developer to specify which code can be run on device (GPU and its memory) or host (CPU and system memory). After setting all of proper background, the use of CUDA C to run parallel programs on GPU are discussed from Chapter 4 to Chapter 7. In Chapter 8, authors try to illustrate how to incorporate rendering and general purpose computation by using CUDA C. Readers without background in OpenGL or DirectX, can skip this chapter and go to the next. However, this chapter is a great addition to the book since it gives readers complete view of CUDA C. Even though CUDA C turns complicated application with single thread execution into easier case by parallel processing, there are some situation that special care should be taken when simple single thread application are tried to implement on massively parallel architecture; Chapter 9 discusses this topic. Compared to parallelism discussed in above chapters, which refers to parallel execution of a function on different sets of data, in Chapter 10, readers are exposed to a different class of parallelism on GPU, which refers to two or more completely independent tasks to be performed in parallel. Chapter 11 covers how to develop CUDA C application on Multiple GPUS. For further study, Chapter 12 shows more tools to aid CUDA C development and more resources to enhance reader's CUBA C development skills to another level. Summary Jason Sanders and Edward Kandrot wrote this book in such a way that is very easy to read and follow. Also, Sanders and Kandrot never forget the great sense of humor throughout the book. Reading this book is not only a discovery about CUDA C but also a joyful journal. It is highly recommended for students who are interested to learn CUDA C application development as Computer Science major. This book is recommended to be adopted as textbook for undergraduate students studying parallel programming. Jie Cheng, University of Hawaii Hilo
- Research Article
92
- 10.1016/j.cie.2018.12.067
- Dec 29, 2018
- Computers & Industrial Engineering
Accelerating genetic algorithms with GPU computing: A selective overview
- Conference Article
15
- 10.1109/ispass.2015.7095803
- Mar 1, 2015
Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for generalpurpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture’s construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami’s viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.
- Supplementary Content
1
- 10.17638/03089482
- Jun 4, 2020
- University of Liverpool
The Graphical Processing Unit is a specialised piece of hardware that contains many low powered cores, available on both the consumer and industrial market. The original Graphical Processing Units were designed for processing high quality graphical images, for presentation to the screen, and were therefore marketed to the computer games market segment. More recently, frameworks such as CUDA and OpenCL allowed the specialised highly parallel architecture of the Graphical Processing Unit to be used for not just graphical operations, but for general computation. This is known as General Purpose Programming on Graphical Processing Units, and it has attracted interest from the scientific community, looking for ways to exploit this highly parallel environment, which was cheaper and more accessible than the traditional High Performance Computing platforms, such as the supercomputer. This interest in developing algorithms that exploit the parallel architecture of the Graphical Processing Unit has highlighted the need for scientists to be able to analyse proposed algorithms, just as happens for proposed sequential algorithms. In this thesis, we study the abstract modelling of computation on the Graphical Processing Unit, and the application of Graphical Processing Unit-based algorithms in the field of bioinformatics, the field of using computational algorithms to solve biological problems. We show that existing abstract models for analysing parallel algorithms on the Graphical Processing Unit are not able to sufficiently and accurately model all that is required. We propose a new abstract model, called the Abstract Transferring Graphical Processing Unit Model, which is able to provide analysis of Graphical Processing Unit-based algorithms that is more accurate than existing abstract models. It does this by capturing the data transfer between the Central Processing Unit and the Graphical Processing Unit. We demonstrate the accuracy and applicability of our model with several computational problems, showing that our model provides greater accuracy than the existing models, verifying these claims using experiments. We also contribute novel Graphics Processing Unit-base solutions to two bioinformatics problems: DNA sequence alignment, and Protein spectral identification, demonstrating promising levels of improvement against the sequential Central Processing Unit experiments.
- Research Article
29
- 10.1016/j.compstruc.2015.03.005
- Apr 11, 2015
- Computers & Structures
An explicit dynamics GPU structural solver for thin shell finite elements
- Conference Article
2
- 10.1117/12.872514
- Jan 23, 2011
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
Graphical Processing Units (GPU) architectures are massively used for resource-intensive computation. Initially dedicated to imaging, vision and graphics, these architectures serve nowadays a wide range of multi-purpose applications. The GPU structure, however, does not suit to all applications. This can lead to performance shortage. Among several applications, the aim of this work is to analyze GPU structures for image analysis applications in multispectral to ultraspectral imaging. Algorithms used for the experiments are multispectral and hyperspectral imaging dedicated to art authentication. Such algorithms use a high number of spatial and spectral data, along with both a high number of memory accesses and a need for high storage capacity. Timing performances are compared with CPU architecture and a global analysis is made according to the algorithms and GPU architecture. This paper shows that GPU architectures are suitable to complex image analysis algorithm in multispectral.
- Book Chapter
2
- 10.1049/pbpc022e_ch4
- Jun 3, 2019
This chapter introduces a new resource virtualization framework, Zorua, that decouples the graphics processing unit (GPU) programming model from the management of key on-chip resources in hardware to enhance programming ease, portability, and performance. The application resource specification-a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block-forms a critical component of the existing GPU programming models. This specification determines the parallelism, and, hence, performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely based on this specification. This tight coupling between the software-provided resource specification and resource management in hardware leads to significant challenges in programming ease, portability, and performance, as we demonstrate in this chapter using real data obtained on state-of-the-art GPU systems. Our goal in this work is to reduce the dependence of performance on the software-provided static resource specification to simultaneously alleviate the above challenges. To this end, we introduce Zorua, a new resource virtualization framework, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer. The virtualization provided by Zorua builds on two key concepts-dynamic allocation of the on-chip resources and their oversubscription using a swap space in memory. Zorua provides a holistic GPU resource virtualization strategy designed to (i) adaptively control the extent of oversubscription and (ii) coordinate the dynamic management of multiple on-chip resources to maximize the effectiveness of virtualization.We demonstrate that by providing the illusion of more resources than physically available via controlled and coordinated virtualization, Zorua offers several important benefits: (i) Programming ease. It eases the burden on the programmer to provide code that is tuned to efficiently utilize the physically available on-chip resources. (ii) Portability. It alleviates the necessity of retuning an application's resource usage when porting the application across GPU generations. (iii) Performance. By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the resources. The holistic virtualization provided by Zorua has many other potential uses, e.g., fine-grained resource sharing among multiple kernels, low latency preemption of GPU programs, and support for dynamic parallelism, which we describe in this chapter.
- Conference Article
40
- 10.1109/apccas.2012.6419068
- Dec 1, 2012
GPU (Graphics Processing Unit) has a great impact on computing field. To enhance the performance of computing systems, researchers and developers use the parallel computing architecture of GPU. On the other hand, to reduce the development time of new products, two programming models are included in GPU, which are OpenCL (Open Computing Language) and CUDA (Compute Unified Device Architecture). The benefit of involving the two programming models in GPU is that researchers and developers don't have to understand OpenGL, DirectX or other program design, but can use GPU through simple programming language. OpenCL is an open standard API, which has the advantage of cross-platform. CUDA is a parallel computer architecture developed by NVIDIA, which includes Runtime API and Driver API. Compared with OpenCL, CUDA is with better performance. In this paper, we used plenty of similar kernels to compare the computing performance of C, OpenCL and CUDA, the two kinds of API's on NVIDIA Quadro 4000 GPU. The experimental result showed that, the executive time of CUDA Driver API was 94.9%∼99.0% faster than that of C, while and the executive time of CUDA Driver API was 3.8%∼5.4% faster than that of OpenCL. Accordingly, the cross-platform characteristic of OpenCL did not affect the performance of GPU.