Accurate estimation of fractional vegetation coverage (FVC) is essential for assessing the ecological environment and acquiring ecological information. However, under natural lighting conditions, shadows in vegetation scenes can easily lead to confusion between shadowed vegetation and shadowed soil, leading to misclassification and omission errors. This issue limits the precision of both vegetation extraction and FVC estimation. To address this challenge, this study introduces a novel deep learning model, the Mixture of Modality Transformer (MoMFormer), which is specifically designed to mitigate shadow interference in vegetation extraction. Our model uses the Swin-transformer V2 as a feature extractor, effectively capturing vegetation features from a dual-modality (regular-exposure RGB and high dynamic range HDR) dataset. A dynamic aggregation module (DAM) is integrated to adaptively blend the most relevant vegetation features. We selected several state-of-the-art (SOTA) methods and conducted extensive experiments using a self-annotated dataset featuring diverse vegetation–soil scenes and compare our model with several state-of-the-art methods. The results demonstrate that MoMFormer achieves an accuracy of 89.43 % on the HDR-RGB dual-modality dataset, with an FVC accuracy of 87.57 %, outperforming other algorithms and demonstrating high vegetation extraction accuracy and adaptability under natural lighting conditions. This research offers new insights into accurate vegetation information extraction in naturally lit environments with shadows, providing robust technical support for high-precision validation of vegetation coverage products and algorithms based on multimodal data. The code and datasets used in this study are publicly available at https://github.com/hhhxiaohe/MoMFormer.