Modular multiplication is the core operation in public-key cryptographic algorithms such as RSA and the Diffie-Hellman algorithm. The efficiency of the modular multiplier plays a crucial role in the performance of these cryptographic methods. In this paper, improvements to FFT-based Montgomery Modular Multiplication (FFTM $^3$ ) using carry-save arithmetic and pre-computation techniques are presented. Moreover, pseudo-Fermat number transform is used to enrich the supported operand sizes for the FFTM $^3$ . The asymptotic complexity of our method is $O(l\; \log\, l\; \log\; \log l)$ , which is the same as the Schonhage-Strassen multiplication algorithm (SSA). A systematic procedure to select suitable parameter set for the FFTM $^3$ is provided. Prototypes of the improved FFTM $^3$ multiplier with appropriate parameter sets are implemented on Xilinx Virtex-6 FPGA. Our method can perform 3,100-bit and 4,124-bit modular multiplications in 6.74 and 7.78 $\mu$ s, respectively. It offers better computation latency and area-latency product compared to the state-of-the-art methods for operand size of 3,072-bit and above.