Advanced scientific and technological measurement methods are the basis of sports research and the progress of sports research methods. It promotes the continuous improvement of sports technology to meet athletes’ pursuit of excellent sports achievements. Therefore, the method of sports scientific research must be changed from the traditional empirical training mode to the programmed training method, from the qualitative analysis of the training effect to the nuanced analysis of the training process. Based on this, this paper uses in-depth learning to study sports motion capture. First, two-dimensional human joints are extracted based on Mask R-CNN. Then, the 3D human motion skeleton is constructed by using the binocular vision system, and the Mask R-CNN human pose estimation algorithm is optimized. On this basis, a sports motion capture system is designed, and the system’s accuracy is verified. The error of the depth information obtained by the sports capture system constructed in this paper is less than 3% in the experiment of about 2[Formula: see text]m. It has strong practicability.