CUDA Research Articles

In recent years, the rapid development of unmanned aerial vehicle (UAV) technologies has made data acquisition increasingly convenient, and three-dimensional (3D) reconstruction has emerged as a popular subject of research in this context. These 3D models have many advantages, such as the ability to represent realistic scenes and a large amount of information. However, traditional 3D reconstruction methods are expensive, and require long and complex processing. As a result, they cannot rapidly respond when used in time-sensitive applications, e.g., those for such natural disasters as earthquakes, debris flow, etc. Computer vision-based simultaneous localization and mapping (SLAM) along with hardware development based on embedded systems, can provide a solution to this problem. Based on an analysis of the principle and implementation of the visual SLAM algorithm, this study proposes a fast method to quickly reconstruct a dense 3D point cloud model on a UAV platform combined with an embedded graphics processing unit (GPU). The main contributions are as follows: (1) to resolve the contradiction between the resource limitations and the computational complexity of visual SLAM on UAV platforms, the technologies needed to compute resource allocation, communication between nodes, and data transmission and visualization in an embedded environment were investigated to achieve real-time data acquisition and processing. Visual monitoring to this end is also designed and implemented. (2) To solve the problem of time-consuming algorithmic processing, a corresponding parallel algorithm was designed and implemented based on the parallel programming framework of the compute unified device architecture (CUDA). (3) The visual odometer and methods of 3D “map” reconstruction were designed using under a monocular vision sensor to implement the prototype of the fast 3D reconstruction system. Based on preliminary results of the 3D modeling, the following was noted: (1) the proposed method was feasible. By combining UAV, SLAM, and parallel computing, a simple and efficient 3D reconstruction model of an unknown area was obtained for specific applications. (2) The parallel SLAM algorithm used in this method improved the efficiency of the SLAM algorithm. On the one hand, the SLAM algorithm required 1/6 of the time taken by the structure-from-motion algorithm. On the other hand, the speedup obtained using the parallel SLAM algorithm based on the embedded GPU on our test platform was 7.55 × that of the serial algorithm. (3) The depth map results show that the effective pixel with an error less than 15cm is close to 60%.

Read full abstract

Nonequispaced discrete Fourier transformation (NDFT) is widely applied in all aspects of computational science and engineering. The computational efficiency and accuracy of NDFT has always been a critical issue in hindering its comprehensive applications both in intensive and in extensive aspects of scientific computing. In our previous work Yang et al. (2018), a CUNFFT method was proposed and it shown outstanding performance in handling NDFT at intermediate scale based on CUDA (Compute Unified Device Architecture) technology. In the current work, we further improved the computational efficiency of the CUNTTF method using an efficient MPI-CUDA hybrid parallelization (HP) scheme of NFFT to achieve a cutting-edge treatment of NDFT at super extended scale. Within this HP-NFFT method, the spatial domain of NDFT is decomposed into several parts according to the accumulative feature of NDFT and the detailed number of CPU and GPU nodes. These decomposed NDFT subcells are independently calculated on different CPU nodes using a MPI process-level parallelization mode, and on different GPU nodes using a CUDA thread-level parallelization mode and CUNFFT algorithm. A massive benchmarking of the HP-NFFT method indicates that this method exhibits a dramatic improvement in computational efficiency for handling NDFT at super extended scale without loss of computational precision. Furthermore, the HP-NFFT method is validated via the calculation of Madelung constant of fluorite crystal structure, and thereafter verified that this method is robust for the calculation of electrostatic interactions between charged ions in molecular dynamics simulation systems. Program summaryProgram title: HP-NFFTCPC Library link to program files:http://dx.doi.org/10.17632/ys2y92jkwy.1Licensing provisions: GNU General Public License 3Programming language: MPI, C, and CUDA CSupplementary material: The program is designed for effective computation of large-scale nonequispaced discrete Fourier transformation (NDFT), which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) one single computer node with Intel(R) Core(TM) i7-3770 @ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations.Nature of problem: For NDFT, the computation is extremely time-consuming in many domains of computational physics due to the failure to utilize FFT directly, which often affects the computational efficiency of whole system seriously. Although the parallel method CUNFFT based on GPU has achieved a qualitative leap compared with previous methods in NDFT computation, the computation capability is limited to the throughput capacity of GPU when the size of NDFT system is big enough.Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the NDFT computation effectively. Firstly, the NDFT system is divided into several subcells via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subcell, and furthermore each subcell in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills.Restrictions: The HP-NFFT is mainly oriented to, and has shown significant computational efficiency to deal with large-scale NDFT computations. However, for a small NDFT system containing less than 10^6 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency compared with the mode of a single computer node because of the serious network delay.

Read full abstract

CUDA Research Articles

Related Topics

Articles published on CUDA

GPU Programming Productivity in Different Abstraction Paradigms

Fast Reconstruction of 3D Point Cloud Model Using Visual SLAM on Embedded UAV Development Platform

A radix sorting parallel algorithm suitable for graphic processing unit computing

Implicit discrete ordinates discontinuous Galerkin method for radiation problems on shared-memory multicore CPU/many-core GPU computation architecture

Gradient clustering algorithm based on deep learning aerial image detection

NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies

Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

A Graphics Process Unit-Based Multiple-Relaxation-Time Lattice Boltzmann Simulation of Non-Newtonian Fluid Flows in a Backward Facing Step

End to end simulators: a flexible and scalable cloud-based architecture

Visualization and Analysis of the Volume Dataset for Time‐Varing Electromagnetic Simulation

Optimizing non-coalesced memory access for irregular applications with GPU computing

A versatile smoothed particle hydrodynamics code for graphic cards

An Overview of Hardware Implementation of Membrane Computing Models

Parallel implementation of the non-overlapping template matching test using CUDA

A hybrid MPI-CUDA approach for nonequispaced discrete Fourier transformation

A parallel hybrid implementation of the 2D acoustic wave equation

Optimization of GPU parallel scheme for simulating ultrafast magnetization dynamics model

Stabilized variational formulation of an oldroyd-B fluid flow equations on a Graphic Processing Unit (GPU) architecture

Performance gains with Compute Unified Device Architecture-enabled eddy current correction for diffusion MRI.

Direct numerical simulations of turbulent channel flows with mesh-refinement lattice Boltzmann methods on GPU cluster

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

CUDA Research Articles

Related Topics

Articles published on CUDA

GPU Programming Productivity in Different Abstraction Paradigms

Fast Reconstruction of 3D Point Cloud Model Using Visual SLAM on Embedded UAV Development Platform

A radix sorting parallel algorithm suitable for graphic processing unit computing

Implicit discrete ordinates discontinuous Galerkin method for radiation problems on shared-memory multicore CPU/many-core GPU computation architecture

Gradient clustering algorithm based on deep learning aerial image detection

NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies

Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

A Graphics Process Unit-Based Multiple-Relaxation-Time Lattice Boltzmann Simulation of Non-Newtonian Fluid Flows in a Backward Facing Step

End to end simulators: a flexible and scalable cloud-based architecture

Visualization and Analysis of the Volume Dataset for Time‐Varing Electromagnetic Simulation

Optimizing non-coalesced memory access for irregular applications with GPU computing

A versatile smoothed particle hydrodynamics code for graphic cards

An Overview of Hardware Implementation of Membrane Computing Models

Parallel implementation of the non-overlapping template matching test using CUDA

A hybrid MPI-CUDA approach for nonequispaced discrete Fourier transformation

A parallel hybrid implementation of the 2D acoustic wave equation

Optimization of GPU parallel scheme for simulating ultrafast magnetization dynamics model

Stabilized variational formulation of an oldroyd-B fluid flow equations on a Graphic Processing Unit (GPU) architecture

Performance gains with Compute Unified Device Architecture-enabled eddy current correction for diffusion MRI.

Direct numerical simulations of turbulent channel flows with mesh-refinement lattice Boltzmann methods on GPU cluster