Compressed sensing (CS) is becoming a hot topic in recent years for its advantages such as low-power consumption, low memory requirement, and low sampling frequency. However, high-dimensional nonlinear signals will inevitably introduce notable complexity and low efficiency in signal recovery. Generalized turbo signal recovery (G-Turbo-SR) is a cutting-edge method, which efficiently reduces complexity with a partial discrete Fourier transform (DFT) sensing matrix. However, in practical applications, G-Turbo-SR still suffers from high complexity for probability computations and matrix multiplications. This article optimizes the algorithm of G-Turbo-SR in scheduling to reduce the matrix multiplications by half. High-precision numerical approximation method is proposed to replace the complex integral calculation, which efficiently reduces the hardware cost with acceptable performance degradation. Based on the data-flow graph (DFG) analysis, detailed hardware architecture is proposed with module designs. Proper quantization scheme is selected according to the mean square error (MSE) performance. Pipelining, folding, and variable precision quantization (VPQ) scheme are employed for higher hardware efficiency. FPGA implementation on Xilinx 7k325tffg900-2 shows a higher throughput and hardware efficiency compared to existing recovery methods for CS.