For traditional public key cryptography and post-quantum cryptography, such as elliptic curve cryptography and supersingular isogeny key encapsulation, modular multiplication is the most performance-critical operation among basic arithmetic of these cryptographic schemes. For this reason, the execution timing of such cryptographic schemes, which may highly determine that the service availability for low-end microprocessors (e.g., 8-bit AVR, 16-bit MSP430X, and 32-bit ARM Cortex-M), mainly relies on the efficiency of modular multiplication on target embedded processors. In this article, we present new optimal modular multiplication techniques based on the interleaved Montgomery multiplication on 16-bit MSP430X microprocessors, where the multiplication part is performed in a hardware multiplier and the reduction part is performed in a basic arithmetic logic unit (ALU) with the optimal modular multiplication routine, respectively. This two-step approach is effective for the special modulus of NIST curves, SM2 curves, and supersingular isogeny key encapsulation. We further optimized the Montgomery reduction by using techniques for “Montgomery-friendly” prime. This technique significantly reduces the number of partial products. To demonstrate the superiority of the proposed implementation of Montgomery multiplication, we applied the proposed method to the NIST P-256 curve, of which the implementation improves the previous modular multiplication operation by 23.6% on 16-bit MSP430X microprocessors and to the SM2 curve as well (first implementation on 16-bit MSP430X microcontrollers). Moreover, secure countermeasures against timing attack and simple power analysis are also applied to the scalar multiplication of NIST P-256 and SM2 curves, which achieve the 8,582,338 clock cycles (0.53 seconds@16 MHz) and 10,027,086 clock cycles (0.62 seconds@16 MHz), respectively. The proposed Montgomery multiplication is a generic method that can be applied to other cryptographic schemes and microprocessors with minor modifications.