Pose estimation is a typical problem encountered in computer vision. Therefore, it is important to improve the accuracy of this method. Aiming at the accuracy of pose estimation, we propose a high-accuracy pose estimation algorithm based on the you only look once network and residual network with a monocular camera as the sensor for visual acquisition. This algorithm uses ArUco markers as a reference for object localization and uses the red, green, and blue (RGB) image as the input. The input image is sampled 16 and 32 times to extract the feature image, and the feature image extracted by 16 times sampling is passed through the pass-through layer and then combined with the feature image extracted by 32 times sampling to accomplish the dimension expansion. The feature image is identified by the convolutional layer. The EPnP algorithm is used to solve the camera poses. The pose information of the target object in the RGB image is used as the output. By comparing the pose estimation accuracy for the LINEMOD dataset with three evaluation metrics—the 2D projection metric, ADD metric, and 5 cm to 5 deg metric—it can be observed that the pose estimation algorithm proposed has advantages in terms of accuracy compared with traditional pose estimation algorithms. When the target is very similar to the background objects, the algorithm also achieves good performance.