CUDA Documentation and Libraries
This appendix lists the most important of the many CUDA support documents and also the libraries available from the NVIDIA website and elsewhere.
- Research Article
6
- 10.1016/j.cpc.2018.09.009
- Sep 24, 2018
- Computer Physics Communications
VOLSCAT2.0: The new version of the package for electron and positron scattering off molecular targets
- Conference Article
33
- 10.1109/hotchips.2008.7476520
- Aug 1, 2008
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
- Conference Article
17
- 10.1109/mue.2008.94
- Jan 1, 2008
The power of graphics processing unit(GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of Korea using OpenGL and CUDA libraries on GPU. Since ARIA was announced only 4 years ago, there is no hardware solution yet providing high-speed encryption with ARIA. We make use of GPU as a parallel processors with several grid structures and optimize the encryption speed and the occupancy of shared-memory. As a result, when ARIA is running on GeForce 8800GTS using CUDA library, the speed of the encryption reaches up to 4.8 Gbps which is the fastest implementation of ARIA known to public.
- Research Article
51
- 10.1016/j.cpc.2009.07.009
- Jul 25, 2009
- Computer Physics Communications
SCELib3.0: The new revision of SCELib, the parallel computational library of molecular properties in the Single Center Approach
- Conference Article
- 10.1109/sccc.2018.8705226
- Nov 1, 2018
This paper describes an implementation of a FEM acoustic application on a GPU using C/C++ and CUDA libraries. The acoustic model is a rigid-walled cavity with enclosed fluid and rectangular faces. Natural frequencies were computed using inertia and stiffness matrices in a general eigenvalue problem. These matrices are symmetric, dense and grow in a cubic ratio from the number of divisions in the grid. The model was implemented using cuSOLVER libraries to solve the eigenvalue problem. The MATLAB implementation was performed for CPU in order to compare the results of GPU implementation. The GPU-based Jacobi method in single precision gives the best results, this method is five times faster than the MATLAB implementation. The divide and conquer method in double precision for GPU is the most accurate implementation when comparing with the exact solution of the model. Lastly, the sound pressure distribution in the cavity was graphed using eigenvectors.
- Conference Article
- 10.1109/sbac-padw.2014.22
- Oct 1, 2014
The Atomic Force Microscopy (AFM) is a scanning probe technique widely used to produce nanometric scaled images of virtually any kind of non-conductive or biological surface. Depending on the scanning dimensions an expected AFM image structure is subjected to be modified by great amounts of external noise (low Signal/Noise ratios) - electrical or mechanical - and/or blurring due to the geometry of the measuring probe. In order to minimize such effects, image restoration techniques can be employed. The one based on the minimization of the Tikhonov's regularization functional is described, taking into account the characteristics of the measuring probe and the S/N ratio. This work proposes optimizations on both serial and parallel restoration algorithms, using CUDA library on a General Purpose Graphics Processing Unit, GPGPU, in terms of time performance and quality of restoration in regime of high speed imaging, one frame/sec or more. The results obtained are so far very promising, reaching speedups up to 43x over previous implementations.
- Research Article
13
- 10.4258/hir.2019.25.4.344
- Oct 1, 2019
- Healthcare Informatics Research
ObjectivesHuman motion analysis can be applied to the diagnosis of musculoskeletal diseases, rehabilitation therapies, fall detection, and estimation of energy expenditure. To analyze human motion with micro-Doppler signatures measured by radar, a deep learning algorithm is one of the most effective approaches. Because deep learning requires a large data set, the high cost involved in measuring large amounts of human data is an intrinsic problem. The objective of this study is to augment human motion micro-Doppler data employing generative adversarial networks (GANs) to improve the accuracy of human motion classification.MethodsTo test data augmentation provided by GANs, authentic data for 7 human activities were collected using micro-Doppler radar. Each motion yielded 144 data samples. Software including GPU driver, CUDA library, cuDNN library, and Anaconda were installed to train the GANs. Keras-GPU, SciPy, Pillow, OpenCV, Matplotlib, and Git were used to create an Anaconda environment. The data produced by GANs were saved every 300 epochs, and the training was stopped at 3,000 epochs. The images generated from each epoch were evaluated, and the best images were selected.ResultsEach data set of the micro-Doppler signatures, consisting of 144 data samples, was augmented to produce 1,472 synthesized spectrograms of 64 × 64. Using the augmented spectrograms, the deep neural network was trained, increasing the accuracy of human motion classification.ConclusionsData augmentation to increase the amount of training data was successfully conducted through the use of GANs. Thus, augmented micro-Doppler data can contribute to improving the accuracy of human motion recognition.
- Book Chapter
4
- 10.1007/978-3-319-41956-5_29
- Jul 10, 2016
Atmospheric radionuclide dispersion systems (ARDS) are important tools to predict the impact of radioactive releases from Nuclear Power Plants and guide people evacuation from affected areas. To predict radioactive material dispersion and its consequences to environment, ARDS process information about source term (nuclear material released), weather conditions and geographical features. ARDS are basically comprised by 4 modules: Source Term, Wind Field, Plume Dispersion and Doses Calculations. Wind Field and Plume Dispersion modules are the most computationally expensive, requiring high performance computing to achieve adequate precision in acceptable time. This work focuses on the development of a GPU-based parallel Wind Field module. The program, based on Extrapolated from Stability and Terrain (WEST) model, is under development using C++ language and CUDA libraries. In comparative case study between some parallel and sequential calculations, a speedup of 40 times could be observed.
- Research Article
5
- 10.1007/978-1-0716-0947-7_2
- Jan 1, 2021
- Methods in molecular biology (Clifton, N.J.)
We present SNPInt-GPU, a software providing several methods for statistical epistasis testing. SNPInt-GPU supports GPU acceleration using the Nvidia CUDA framework, but can also be used without GPU hardware. The software implements logistic regression (as in PLINK epistasis testing), BOOST, log-linear regression, mutual information (MI), and information gain (IG) for pairwise testing as well as mutual information and information gain for third-order tests. Optionally, r2 scores for testing for linkage disequilibrium (LD) can be calculated on-the-fly. SNPInt-GPU is publicly available at GitHub. The software requires a Linux-based operating system and CUDA libraries. This chapter describes detailed installation and usage instructions as well as examples for basic preliminary quality control and analysis of results.
- Research Article
- 10.30970/eli.16.1
- Jan 1, 2021
- Electronics and Information Technologies
In this work has been investigated the possibility of using convolutional neural networks to detect synthesized speech. The Python programming language, the TensorFlow library in combination with the high-level Keras API and the ASVspoof 2019 audio database in flac format were used to create the software application. The voice signal of synthesized and natural speech was converted into mel-frequency spectrograms. The structure of a convolutional neural network with high indicators of recognition accuracy is proposed. The learning speed of neural networks on GPU and CPU is compared using the CUDA library. The influence of the batch size parameter on the accuracy of the neural network was investigated. The TensorBoard tool was used to monitor and profile the learning process of neural networks. Keywords: audio deepfake, mel-frequency sound spectrograms, convolutional neural networks, learning speed of neural networks.
- Abstract
23
- 10.1186/1471-2202-14-s1-p38
- Jul 1, 2013
- BMC Neuroscience
Brian 2 is a fundamental rewrite of the Brian [1,2] simulator for spiking neural networks. Brian is written in the Python programming language and focuses on simplicity and extensibility: neuronal models can be described using mathematical formulae (differential equations) and with the use of physical units. Depending on the model equations, several integration methods are available, ranging from exact integration for linear differential equations to numerical integration for arbitrarily complex equations. The same formalism can also be used to specify synaptic models, allowing the user to easily define complex synapse models. Brian 2 keeps most of the syntax and functionality consistent with previous versions of Brian, but achieves more consistency and modularity as well as adding new features such as a simpler and more general new formulation of refractoriness. A consistent interface centered around human-readable descriptions using mathematical notation allows the specification of neuronal models (including complex reset, threshold and refractory conditions), synaptic models (including complex plasticity rules) and synaptic connections. Every aspect of Brian 2 has been designed with extensibility and adaptability in mind, which, for example, makes it straightforward to implement new numerical integration methods. Even though Brian 2 benefits from the ease of use and the flexibility of the Python programming language, its performance is not limited by the speed of Python: At the core of the simulation machinery Brian 2 makes use of fully automated runtime code generation [3], allowing the same model to be run in the Python interpreter, in compiled C++ code or on a GPU using CUDA libraries[4]. The code generation system is designed to be extensible to new target languages and its output can also be used on its own: for situations where high performance is necessary and/or where a Python interpreter is not available (for example for robotics applications), Brian 2 offers tools to assist in assembling the generated code into a stand-alone version that runs independently of Brian or a Python interpreter. To ensure the correctness and maintainability of the software, Brian 2 includes an extensive, full coverage test suite. Debugging of simulation scripts is supported by a configurable logging system, allowing simple monitoring of the internal details of the simulation process. Brian is made available under a free software license and all development takes place in public code repositories [5].
- Book Chapter
9
- 10.1007/978-3-030-24209-1_12
- Jan 1, 2019
GPU-based parallelization of agent-based modeling (ABM) has been highlighted for the last decade to address its computational needs for scalable and long-run simulations in practical use. From the software productivity viewpoint, model designers would prefer general ABM frameworks for GPU parallelization. However, having transited from single-node or cluster-computing platforms to GPUs, most general ABM frameworks maintain their APIs at the script level, delegate only a limited number of agent functions to GPUs, and copy agent data between host and device memory for each function call, which cannot ease agent description nor maximize GPU parallelism. To respond to these problems, we have developed the MASS (Multi-Agent Spatial Simulation) CUDA library that allows users to describe all simulation models in CUDA C++, to automate entire model parallelization at GPU, and to minimize host-to-device memory transfer. However, our straightforward implementation did not improve the parallel performance. Focusing on the data-parallel computation with GPU, we examined MASS overheads in GPU memory usage and developed optimization techniques that reduce kernel context switches, optimize kernel configuration, use constant memory, and reduce overheads incurred by agent population, migration, and termination. These techniques improved Heat2D and SugarScape’s execution performance, respectively 3.9 times and 5.8 times faster than the corresponding C++ sequential programs. This paper gives details of our GPU parallelization techniques for multi-agent simulation and demonstrates the MASS CUDAs performance improvements.
- Research Article
11
- 10.1016/j.compbiomed.2021.104507
- May 21, 2021
- Computers in Biology and Medicine
Simulation of 3D centimeter-scale continuum tumor growth at sub-millimeter resolution via distributed computing
- Research Article
9
- 10.1016/j.cpc.2019.106970
- Sep 30, 2019
- Computer Physics Communications
SCELib4.0: The new program version for computing molecular properties in the Single Center Approach
- Research Article
7
- 10.1016/j.cpc.2017.03.006
- Mar 23, 2017
- Computer Physics Communications
GPU implementation of the Rosenbluth generation method for static Monte Carlo simulations