In construction and demolition waste (CDW), there are materials that are difficult to separate by density and magnetic characteristics alone. To realise a finer separation of these difficult-to-separate materials, a multi-sensor fusion method for the visual detection of CDW is proposed. This method consists of a red–green–blue–depth (RGBD) data collection method with high precision in the height direction and a multimodal data processing algorithm. To obtain comprehensive features accurately and efficiently, the proposed method uses two independent MobileNet branches to extract colour visual features and geometric features separately, and an encoder is used to control the frequency of the two models in real time to better match the two data types. The proposed method makes full use of the height data to filter out the background of the conveyor belt and the invalid areas generated by the region proposal networks to improve its robustness. Experiments show that the accuracy of the proposed visual detection method is over 90%. Additionally, computational efficiency reaches 1.67 frames per second (fps) on a central processing unit (CPU) and 30 fps on a graphics processing unit. The detection efficiency when running on the CPU could reach 25 000 CDW items per hour.