Backpack computers require powerful, intelligent computing capabilities for field wearables while taking energy consumption into careful consideration. A recommended solution for this demand is the CPU + NPU-based SoC. In many wearable intelligence applications, the Fourier Transform is an essential, computationally intensive preprocessing task. However, due to the unique structure of the NPU, the conventional Fourier Transform algorithms cannot be applied directly to it. This paper proposes two NPU-accelerated Fourier Transform algorithms that leverage the unique hardware structure of the NPU and provides three implementations of those algorithms, namely MM-2DFT, MV-2FFTm, and MV-2FFTv. Then, we benchmarked the speed and energy efficiency of our algorithms for the gray image edge filtering task on the Huawei Atlas200I-DK-A2 development kits against the Cooley-Tukey algorithm running on CPU and GPU platforms. The experiment results reveal MM-2DFT outperforms OpenCL-based FFT on NVIDIA Tegra X2 GPU for small input sizes, with a 4- to 8-time speedup. As the input image resolution exceeds 2048, MV-2FFTv approaches GPU computation speed. Additionally, two scenarios were tested and analyzed for energy efficiency, revealing that cube units of the NPU are more energy efficient. The vector and CPU units are better suited for sparse matrix multiplication and small-scale inputs, respectively.
Read full abstract