In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.