The semantic segmentation of 3D point clouds holds paramount importance in visual perception tasks of automatic driving, including obstacle avoidance, decision control, path planning, and map construction, etc. Multi-sensor fusion is a rational and pivotal technique for implementing LiDAR semantic segmentation. However, effectively fusing and utilizing multi-source data remains a challenging task. In this work, we introduce a novel visual perception-assisted point cloud segmentation network, termed VPA-Net. This network architecture employs a dual-branch design to effectively combine spatial information from point clouds and visual cues from images, thereby bolstering the performance of 3D LiDAR semantic segmentation. More specifically, the dual-branch network structure processes and fuses multi-modal data from both point clouds and RGB images. Subsequently, the intermediate features from the two branches are merged via the proposed attention-based feature fusion module. Furthermore, to address the challenge of precise boundary prediction in large-scale point cloud scene segmentation, we introduce a refinement module based on 3D sparse convolution to enhance the spatial information of the LiDAR point clouds. The effectiveness of our method is validated on SemanticKITTI and a more challenging 3D semantic segmentation dataset. Experimental results underscore significant improvement on SemanticKITTI, with our approach surpassing the state-of-the-art method, achieving a 7.1% higher mIoU than SalsaNext.