Perceiving the three-dimensional structure of the surrounding environment and analyzing it for autonomous movement is an indispensable element for robots to operate in scenes. Recovering depth information and the three-dimensional spatial structure from monocular images is a basic mission of computer vision. For the objects in the image, there are many scenes that may produce it. This paper proposes to use a supervised end-to-end network to perform depth estimation without relying on any subsequent processing operations, such as probabilistic graphic models and other extra fine steps. This paper uses an encoder-decoder structure with feature pyramid to complete the prediction of dense depth maps. The encoder adopts ResNeXt-50 network to achieve main features from the original image. The feature pyramid structure can merge high and low level information with each other, and the feature information is not lost. The decoder utilizes the transposed convolutional and the convolutional layer to connect as an up-sampling structure to expand the resolution of the output. The structure adopted in this paper is applied to the indoor dataset NYU Depth v2 to obtain better prediction results than other methods. The experimental results show that on the NYU Depth v2 dataset, our method achieves the best results on 5 indicators and the sub-optimal results on 1 indicator.