Since single sensor and high-density point cloud data processing have certain direct processing limitations in urban traffic scenarios, this paper proposes a 3D instance segmentation and object detection framework for urban transportation scenes based on the fusion of Lidar remote sensing technology and optical image sensing technology. Firstly, multi-source and multi-mode data pre-fusion and alignment of Lidar and camera sensor data are effectively carried out, and then a unique and innovative network of stereo regional proposal selective search-driven DAGNN is constructed. Finally, using the multi-dimensional information interaction, three-dimensional point clouds with multi-features and unique concave-convex geometric characteristics are instance over-segmented and clustered by the hypervoxel storage in the remarkable octree and growing voxels. Finally, the positioning and semantic information of significant 3D object detection in this paper are visualized by multi-dimensional mapping of the boundary box. The experimental results validate the effectiveness of the proposed framework with excellent feedback for small objects, object stacking, and object occlusion. It can be a remediable or alternative plan to a single sensor and provide an essential theoretical and application basis for remote sensing, autonomous driving, environment modeling, autonomous navigation, and path planning under the V2X intelligent network space– ground integration in the future.