Non-binary low-density parity-check (NB-LDPC) codes outperform their binary counterparts in many cases. However, an NB-LDPC decoder usually requires excessive hardware resources and memory consumption. The trellis-based min-max decoding algorithm (TMMA), a well-known algorithm proposed in recent years, achieves good tradeoff between decoding performance and hardware complexity. Note that the check node processing unit (CNU) occupies the most hardware consumption. Based on the TMMA, many simplifications for the CNU have been developed with slight performance loss. The current TMMA with L truncations (L-TMMA) is promising for higher hardware efficiency than others. In this brief, based on the L-TMMA, we propose a new CNU design by incorporating algorithmic transformation and architectural optimization to further reduce the hardware complexity and thereby the critical path without any performance degradation. Synthesis results show that the proposed design achieves the lowest hardware consumption and the highest clock frequency with a small latency compared to the state-of-the-arts. Specifically, it saves more than 1/3 hardware resources compared with its original one.