In order to improve the target visual recognition and localization accuracy of robotic arms in complex scenes with similar targets, hybrid recognition and localization methods based on an industrial camera and depth camera are proposed. First, according to the speed and accuracy requirements of target recognition and localization, YOLOv5s is introduced as the basic algorithm model for target hybrid recognition and localization. Then, in order to improve the accuracy of target recognition and coarse localization based on an industrial camera (eye-to-hand), the AFPN feature fusion module, simple and parameter-free attention module (SimAM), and soft non-maximum suppression (Soft NMS) are introduced. In order to improve the accuracy of target recognition and fine localization based on a depth camera (eye-in-hand), the SENetV2 backbone network structure, dynamic head module, deformable attention mechanism, and chain-of-thought prompted adaptive enhancer network are introduced. After that, on the basis of constructing a dual camera platform for target hybrid recognition and localization, the hand–eye calibration, collection and production of image datasets required for model training are completed. Finally, for the docking of the oil filling port, the hybrid recognition and localization experimental tests are completed in sequence. The test results show that in target recognition and coarse localization based on the industrial camera, the recognition accuracy of the designed model reaches 99%, and the average localization errors in the horizontal and vertical directions are 2.22 mm and 3.66 mm, respectively. In target recognition and fine localization based on the depth camera, the recognition accuracy of the designed model reaches 98%, and the average errors in depth, horizontal, and vertical directions are 0.12 mm, 0.28 mm, and 0.16 mm, respectively. These not only verify the effectiveness of the target hybrid recognition and localization methods based on dual cameras, but also demonstrate that they meet the high-precision recognition and localization requirements in complex scenes.