Providing superior algorithm-level performance, the non-binary low-density parity-check (NB-LDPC) code is now expected to be one of the next-generation error-correction codes. However, it is hard to implement a high-throughput NB-LDPC decoder in practice due to its impractical processing complexity and the excessively long decoding time. Based on the previous extended min-sum (EMS) approach, in this work, we introduce the parallel EMS (pEMS) decoding algorithm that reduces the processing latency of each iteration by managing multiple message entries at a time. The previous two-phase node-level operation is modified to promote the proposed parallel processing without performance degradation, where the delay overheads are minimized by carefully optimizing the internal sorters with input attributes. In addition, the data accessing sequence is precisely adjusted to reduce the number of waiting cycles, further increasing the overall processing efficiency. Implemented in a 22-nm FinFET technology, as a result, the prototype two-parallel decoder for (160, 80) NB-LDPC codes operates at the speed of 950 MHz, achieving the decoding throughput of more than 7 Gb/s, which is 3.2 times faster than the state-of-the-art design.