Use Of Atomic Operations Research Articles

Spatiotemporal feature extraction algorithms are widely used in many image processing and computer vision applications. They are favored because of their robust generated features. However, they have high computational complexity. Parallelizing these algorithms, in order to speed their execution up, is of great importance. In this paper, we propose new parallel implementations, using GPU computing, for the two most widely used spatiotemporal feature extraction algorithms: scale-invariant feature transform and speeded up robust features. In our implementations, we solve problems with previous parallel implementations, such as load imbalance, thread synchronization, and the use of atomic operations. Our implementations speed up the execution by simultaneously processing all the work of each stage of the two algorithms, without dividing that stage into smaller sequential ones. The allocation of the threads in our implementations further allows them to increase the occupancy of the GPU streaming multiprocessors (SMs). We compare our presented implementations to previous CPU and GPU parallel implementations of the two algorithms. Results show that the proposed implementations could do all the processing in real time with high accuracy. They further achieve higher speedup, frame rate, and SM occupancy than the previous best-known parallel implementations of the two algorithms.

Read full abstract

We present a scalable dissipative particle dynamics simulation code, fully implemented on the Graphics Processing Units (GPUs) using a hybrid CUDA/MPI programming model, which achieves 10–30 times speedup on a single GPU over 16 CPU cores and almost linear weak scaling across a thousand nodes. A unified framework is developed within which the efficient generation of the neighbor list and maintaining particle data locality are addressed. Our algorithm generates strictly ordered neighbor lists in parallel, while the construction is deterministic and makes no use of atomic operations or sorting. Such neighbor list leads to optimal data loading efficiency when combined with a two-level particle reordering scheme. A faster in situ generation scheme for Gaussian random numbers is proposed using precomputed binary signatures. We designed custom transcendental functions that are fast and accurate for evaluating the pairwise interaction. The correctness and accuracy of the code is verified through a set of test cases simulating Poiseuille flow and spontaneous vesicle formation. Computer benchmarks demonstrate the speedup of our implementation over the CPU implementation as well as strong and weak scalability. A large-scale simulation of spontaneous vesicle formation consisting of 128 million particles was conducted to further illustrate the practicality of our code in real-world applications. Program summaryProgram title: GPU-accelerated DPD Package for LAMMPSCatalogue identifier: AETN_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AETN_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: GNU General Public License, version 3No. of lines in distributed program, including test data, etc.: 1602716No. of bytes in distributed program, including test data, etc.: 26489166Distribution format: tar.gzProgramming language: C/C++, CUDA C/C++, MPI.Computer: Any computers having nVidia GPGPUs with compute capability 3.0.Operating system: Linux.Has the code been vectorized or parallelized?: Yes. Number of processors used: 1024 16-core CPUs and 1024 GPUsRAM: 500 Mbytes host memory, 2 Gbytes device memorySupplementary material: The data for the examples discussed in the manuscript is available for download.Classification: 6.5, 12, 16.1, 16.11.Nature of problem:Particle-based simulation of mesoscale systems involving nano/micro-fluids, polymers and spontaneous self-assembly process.Solution method:The system is approximated by a number of coarse-grained particles interacting through pairwise potentials and bonded potentials. Classical mechanics is assumed following Newton’s laws. The evolution of the system is integrated using a time-stepping scheme such as Velocity-Verlet.Restrictions:The code runs only on CUDA GPGPUs with compute capability 3.0.Unusual features:Fully implemented on GPGPUs with significant speedup.Running time:78 h using 1024 GPGPUs for simulating a 128-million-particle system for 18.4 million time steps.

Read full abstract

Use Of Atomic Operations Research Articles

Articles published on Use Of Atomic Operations

Multi-threaded parallel tetrahedral mesh improvement by combining atomic operation and graph coloring

Speeding up spatiotemporal feature extraction using GPU

Checking Concurrent Data Structures Under the C/C++11 Memory Model

Efficient Particle-mesh Spreading on GPUs

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Use Of Atomic Operations Research Articles

Articles published on Use Of Atomic Operations

Multi-threaded parallel tetrahedral mesh improvement by combining atomic operation and graph coloring

Speeding up spatiotemporal feature extraction using GPU

Checking Concurrent Data Structures Under the C/C++11 Memory Model

Efficient Particle-mesh Spreading on GPUs

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications