Data completion is a problem of filling missing or unobserved elements of partially observed datasets. Data completion algorithms have received wide attention and achievements in diverse domains including data mining, signal processing, and computer vision. We observe a ubiquitous tubal-sampling pattern in big data and Internet of Things (IoT) applications, which is introduced by many reasons such as high data acquisition cost, downsampling for data compression, sensor node failures, and packet losses in low-power wireless transmissions. To meet the time and accuracy requirements of applications, data completion methods are expected to be accurate as well as fast. However, the existing methods for data completion with the tubal-sampling pattern are either accurate or fast, but not both. In this article, we propose high-performance graphics processing unit (GPU) tensor completion for data completion with the tubal-sampling pattern. First, by exploiting the convolution theorem, we split a tensor least-squares minimization problem into multiple least-squares sub-problems in the frequency domain. In this way, massive parallelisms are exposed for many-core GPU architectures while still preserving high recovery accuracy. Second, we propose computing slice-level and tube-level tasks in batches to improve GPU utilization. Third, we reduce the data transfer cost by eliminating the accesses to the CPU memory inside algorithm loop structures. The experimental results show that the proposed tensor completion is both fast and accurate. Using synthetic data of varying sizes, the proposed GPU tensor completion achieves maximum 248.18×, 7, 403.27×, and 33.27× speedups over the CPU MATLAB implementation, GPU element-sampling tensor completion in the cuTensor-tubal library, and GPU high-performance matrix completion, respectively. With a 50 percent sampling rate, the proposed GPU tensor completion achieves a recovery error of 1.40e-5, which is comparable with that of the GPU element-sampling tensor completion and three orders of magnitude better than that of the GPU high-performance matrix completion. To utilize multiple GPUs in servers, we design a multi-GPU scheme for tubal-sampling tensor completion. The multi-GPU tensor completion achieves maximum 1.89× speedup on two GPUs versus on a single GPU for medium or big tensors. We further evaluate the performance of the proposed GPU tensor completion in three real applications, namely, video transmission in wireless camera networks, RF fingerprint-based indoor localization, and seismic data completion, and it achieves maximum speedups of 448.68×, 24.63×, and 311.54×, respectively. We integrate this high-performance GPU tensor completion implementation into the cuTensor-tubal library to support various applications.