Compression distorted multi-view video plus depth (MVD) should be enhanced at the receiver side without the original signals, especially the depth maps because they describe the positioning information in 3D space and they are important for subsequent virtual view synthesis. However, challenge arises from how to exploit the contribution from multi-modality priors from neighboring viewpoints, and how to handle the gradient vanishing when textureless depth maps are involved. In this paper, we propose a multi-modality residual network to enhance the quality of compressed multi-view depth video. Taking advantage from high correlation among different viewpoints, depth maps from adjacent views are exploited as guidance for the enhancement of depth video in target view. Color frames in target view are also involved to offer the information object contours, obtaining multi-modality guidance. The proposed network is organized a deep residual network to well eliminate distortion and restore details. Because above multi-modality guidance have different correlations with target depth video and not all information can contribute to the enhancement, an adaptive skip structure is designed to further exploit the contribution from different priors appropriately. Experimental results show that our scheme outperforms other benchmarks and achieves an average 1.935 dB and 0.0227 gains on PSNR and SSIM over all test sequences, respectively. All results on objective, subjective and 3D reconstruction suggest that our method is able to provide superiority performance in practical applications.