The Vision Transformers (ViTs) models show great potential in recognition due to their excellent self-attention mechanisms. However, they often need more computational complexity, making achieving high performance in resource-constrained environments challenging. This paper proposes a lightweight visual model (called LightViM) to tackle these challenges, specifically devised for recognition challenges under resource-constrained environments. Considering the imperative of reducing the model’s computational complexity and resource consumption , we propose lightweight local–global feature fusion modules based on Mamba (LGF-Mamba) aimed at integrating spatially detailed local information with globally contextualized features while maintaining lower linear time complexity. In LGF-Mamba, Mamba sub-blocks are first designed to extract global information; then, for the rapid extraction of local features, LocalE module is designed and integrated into LGF-Mamba, exploring more comprehensive feature representation. This mechanism efficiently directs the network to integrate features from different spatial scales, thereby improving the recognition performance in resource-constrained environments. Experimental results indicate that, in comparison to other leading-edge lightweight methods, the proposed procedure simultaneously achieves superior recognition performance and minimal resource consumption.