The modular multiplication (MM) is a key operation in cryptographic algorithms, such as RSA and elliptic-curve cryptography. Multicore processor is a suitable platform to implement MM because of its flexibility, high performance, and energy-efficiency. In this paper, we propose a block-level parallel algorithm for MM with quotient pipelining and optimally map it on a network-on-chip-based multicore platform equipped with broadcasting mechanism. Aiming at highest performance, a theoretical speedup model for parallel MM is also developed for parameter exploration that optimizes task partitioning. Experimental results based on a multicore prototype show that compared with the sequential MM on single core, the parallel implementation proposed in this paper maximizes the speedup ratio with regard to given intercore communication latency.