Automatic segmentation of brain tumors employing images from multi-modalities is important for preoperative diagnosis and prognostic assessment. The rich complementary information contained within multi-modal images allows for improved brain tumor segmentation performance when models are trained on multi-modal data. However, accurate segmentation of small lesion regions from medical images remains challenging, due to the irregular shapes and low boundary contrast of brain gliomas. To address these challenges, we propose a coarse-to-fine feature fusion network (CFNet) that effectively incorporates multi-modal image features through modal interaction, semantic perception, and feature fusion. Specifically, the modality cross-attention fusion module is proposed introducing complementary features and mapping them into a unified semantic space to learn complementary representations between modalities. Furthermore, the multi-scale context perception module enables the model to focus on multi-scale lesion fine-grained information while capturing deep semantics. Moreover, the further fusion of fine-grained features with coarse-grained features enhances edge features and the complementary of features. We evaluate the proposed CFNet on two public brain tumor segmentation datasets BraTS2019 and BraTS2020. Experimental results show that the proposed framework outperforms state-of-the-art methods in Dice score, HD95, Sensitivity, demonstrating the effectiveness of CFNet on brain tumor segmentation. Code will be available at https://github.com/YaruC/CFNet.