ABSTRACT Depth estimation is an essential tool in obtaining depth information for robotic surgery and augmented reality technology in the current laparoscopic surgery robot system. Since there is a lack of ground-truth for depth values and laparoscope motions during operation, depth estimation networks have difficulties in predicting depth maps from laparoscopic images under the supervised strategy. It is challenging to generate the correct depth maps for the different environments from abdominal images. To tackle these problems, we propose a novel monocular self-supervised depth estimation network with sparse nest architecture. We design a non-local block to capture broader and deeper context features that can further enhance the scene-variant generalisation capacity of the network for the differences in datasets. Moreover, we introduce an improved multi-mask feature in the loss function to tackle the classical occlusion problem based on the time-series information from stereo videos. We also use heteroscedastic aleatoric uncertainty to reduce the effect of noisy data for depth estimation. We compared our proposed method with other existing methods for different scenes in datasets. The experimental results show that the proposed model outperformed the state-of-the-art models qualitatively and quantitatively.