This paper presents a novel forward-backward four-way merger min-max algorithm and high-throughput decoder architecture for nonbinary low-density parity-check (NB-LDPC) decoding, which significantly reduces decoding latency. An efficient partial-parallel block-layered decoder architecture suitable for the proposed forward-backward four-way merger algorithm is presented to speed up the decoder convergence. Moreover, a parallel switch network architecture and parallel-serial check node unit are also proposed to facilitate the implementation of the proposed decoder architecture. The proposed algorithm can reduce the number of check node processing steps by half. Consequently, the decoder architecture using the proposed algorithm can achieve a considerably higher throughput, compared to previous works. Two quasi-cyclic NB-LDPC (QC-NB-LDPC) codes over GF(32) as (837, 726) and (744, 653) are synthesized using a 90-nm CMOS technology. The implementation results demonstrate that the proposed decoder architecture can operate at a 370MHz clock frequency, and the throughputs of these two codes are 92.6 Mbps and 118.86 Mbps, respectively.