Multimodal content contains more deception than unimodal information, causing significant social and economic impacts. Current techniques often focus on single modality, neglecting knowledge fusion. While most studies have concentrated on English fake news detection, this study explores multi-modality for low-resource languages like Hindi. This work introduces the MMHFND model, based on M-CLIP, which uses late fusion for coarse (Fake vs Real) and fine-grained (World vs India vs Politics vs News vs Fact-Check) configurations. We extract deep representations from image and text using image transformer ResNet-50, a BERT-based L3cube-HindRoberta text transformer handling headlines, content, OCR text, and image captions, paired M-CLIP transformers, and an ELA (Error Level Analysis) image forensic method incorporating EfficientNet B0 to analyse multimodal news in Hindi language based on Devanagari script. M-CLIP integrates crossmodal similarity mapping of images and texts with retrieved multimodal features. The extracted features undergo redundancy reduction before being channelled into the final classifier. The MAM (Modality Attention Mechanism) is introduced, which generates weights for each modality individually. The MMHFND model uses a computed modality divergence score to identify dissonance between modalities and a modified contrastive loss on the score. We thoroughly analyse HinFakeNews dataset in a multimodal context, achieving accuracy in coarse and fine-grained configurations. We also undertake an ablation study to evaluate outcomes and explore alternative fusion processes on three different setups. The results show that the MMHFND model effectively detects fake news in Hindi with an accuracy of 0.986, outperforming other existing multimodal approaches.