As a multimodal form of hate speech on social media, hateful memes are more aggressive and cryptic threats to the real life of humans. Automatic detection of hateful memes is crucial, but the images and texts in most memes are only weakly consistent or even irrelevant. Although existing works have achieved the initial goal of detecting hateful memes with pre-trained models, they are limited to monolithic inference methods while ignoring the semantic differences between multimodal representations. To strengthen the comprehension and reasoning of the hidden meaning behind the memes by combining real-world knowledge, we propose an enhanced multimodal fusion framework with congruent reinforced perceptron for hateful memes detection. Inspired by the human cognitive mechanism, we first divide the extracted multisource representations into main semantics and auxiliary contexts based on their strength and relevance, and then precode them into lightly correlated embeddings with unified spatial dimensions via a novel prefix uniform layer, respectively. To jointly learn the intrinsic correlation between primary and secondary semantics, a congruent reinforced perceptron with brain-like perceptual integration is designed to seamlessly fuse multimodal representations in a shared latent space while maintaining the feature integrity in the sub-fusion space, thereby implicitly reasoning about the subtle metaphors behind the memes. Extensive experiments on four benchmark datasets fully demonstrate the effectiveness and superiority of our architecture compared with previous state-of-the-art methods.