In the current 3-D object detection tasks, most algorithms are based on pure point cloud. Although LiDAR can provide target location information and contour information for detection, it is sparse, especially for long-distance objects. Besides, camera sensors can provide more detailed target color, texture information, and so on. However, if both point cloud and image data are used for object detection at the same time, the problem of large model capacity and overfitting will occur. Different modes will also produce different gradients for different subnetworks, and the entire network will be difficult to optimize. In order to solve these problems and continuously improve the performance of detection algorithms, this article designs a 3-D object detection method attention mechanism and voxel feature pyramid multimodal VoxelNet (AVFP-MVX), which uses both point cloud and image data to solve the above problems. By referring to MVX-Net, attention mechanism and voxel feature pyramid are used to improve the detection accuracy of 3-D objects. The visualization results show that the overall performance of AVFP-MVX does well, which can accurately select the target object and return to a good bounding box. Comparative tests show that the proposed method detected 91.24%, 80.45%, and 76.91% for the easy, mod, and hard targets of car, respectively, while the average accuracy of pedestrian and cyclist is 62.44% and 67.64%, respectively, which is better than the other methods. The results of ablation experiments show that when the attention mechanism and the voxel feature pyramid network (Voxel-FPN) were added, the detection accuracy of car, pedestrian, and cyclist was increased by 1.87%, 1.85%, and 1.88%, respectively.