AbstractAccurate tracking and reconstruction of surgical scenes is a critical enabling technology toward autonomous robotic surgery. In endoscopic examinations, computer vision has provided assistance in many aspects, such as aiding in diagnosis or scene reconstruction. Estimation of camera motion and scene reconstruction from intra-abdominal images are challenging due to irregular illumination and weak texture of endoscopic images. Current surgical 3D perception algorithms for camera and object pose estimation rely on geometric information (e.g., points, lines, and surfaces) obtained from optical images. Unfortunately, standard hand-crafted local features for pose estimation usually do not perform well in laparoscopic environments. In this paper, a novel self-supervised Surgical Perception Stereo Visual Odometer (SPSVO) framework is proposed to accurately estimate endoscopic pose and better assist surgeons in locating and diagnosing lesions. The proposed SPSVO system combines a self-learning feature extraction method and a self-supervised matching procedure to overcome the adverse effects of irregular illumination in endoscopic images. The framework of the proposed SPSVO includes image pre-processing, feature extraction, stereo matching, feature tracking, keyframe selection, and pose graph optimization. The SPSVO can simultaneously associate the appearance of extracted feature points and textural information for fast and accurate feature tracking. A nonlinear pose graph optimization method is adopted to facilitate the backend process. The effectiveness of the proposed SPSVO framework is demonstrated on a public endoscopic dataset, with the obtained root mean square error of trajectory tracking reaching 0.278 to 0.690 mm. The computation speed of the proposed SPSVO system can reach 71ms per frame.