The rapid development of remote sensing technology provides wealthy data for earth observation. Land-cover mapping indirectly achieves biodiversity estimation at a coarse scale. Therefore, accurate land-cover mapping is the precondition of biodiversity estimation. However, the environment of the wetlands is complex, and the vegetation is mixed and patchy, so the land-cover recognition based on remote sensing is full of challenges. This paper constructs a systematic framework for multisource remote sensing image processing. Firstly, the hyperspectral image (HSI) and multispectral image (MSI) are fused by the CNN-based method to obtain the fused image with high spatial-spectral resolution. Secondly, considering the sequentiality of spatial distribution and spectral response, the spatial-spectral vision transformer (SSViT) is designed to extract sequential relationships from the fused images. After that, an external attention module is utilized for feature integration, and then the pixel-wise prediction is achieved for land-cover mapping. Finally, land-cover mapping and benthos data at the sites are analyzed consistently to reveal the distribution rule of benthos. Experiments on ZiYuan1-02D data of the Yellow River estuary wetland are conducted to demonstrate the effectiveness of the proposed framework compared with several related methods.