Most deep-learning-based multi-view stereo series studies are concerned with improving the depth prediction accuracy of noise-free images. However, it is difficult to obtain off-the-set clean images in practice and 3D convolutional neural networks require a lot of computing resources. To make full use of its computing power, different types of information can be processed simultaneously in the network. For these two issues, this paper proposes a novel multi-stage network architecture to address depth inference and denoising simultaneously. Specifically, 2D feature maps are first converted into 3D cost volumes containing pixel information and depth information through differentiable homography and Gaussian probability mapping. Then, the cost volume is input into the regularisation module in each network stage to obtain the predicted probability volumes. Furthermore, simple static weights lead to training failure, and it is necessary to dynamically adjust the loss function by gradient normalisation. The proposed method can dispose of pixel information and depth information simultaneously and both reach an excellent level. Extensive experimental results show that the authors’ work surpasses the state-of-the-art denoising on the DTU dataset (adding Gaussian–Poisson noise) and is more robust to noise images in depth inference.