Despite the successful application of multimodal deep learning (MDL) methods for land use/land cover (LULC) classification tasks, their fusion capacity has not yet been substantially examined for hyperspectral and synthetic aperture radar (SAR) data. Hyperspectral and SAR data have recently been widely used in land cover classification. However, the speckle noise of SAR and the heterogeneity with the imaging mechanism of hyperspectral data have hindered the application of MDL methods for integrating hyperspectral and SAR data. Accordingly, we proposed a deep feature fusion method called Refine-EndNet that combines a dynamic filter network (DFN), an attention mechanism (AM), and an encoder–decoder framework (EndNet). The proposed method is specifically designed for hyperspectral and SAR data and adopts an intra-group and inter-group feature fusion strategy. In intra-group feature fusion, the spectral information of hyperspectral data is integrated by fully connected neural networks in the feature dimension. The fusion filter generation network (FFGN) suppresses the presence of speckle noise and the influence of heterogeneity between multimodal data. In inter-group feature fusion, the fusion weight generation network (FWGN) further optimizes complementary information and improves fusion capacity. Experimental results from ZY1-02D satellite hyperspectral data and Sentinel-1A dual-polarimetric SAR data illustrate that the proposed method outperforms the conventional feature-level image fusion (FLIF) and MDL methods, such as S2ENet, FusAtNet, and EndNets, both visually and numerically. We first attempt to investigate the potentials of ZY1-02D satellite hyperspectral data affected by thick clouds, combined with SAR data for complex ground object classification in the land cover ecosystem.