In RGB-D (red–green–blue and depth) scene semantic segmentation, depth maps provide rich spatial information to RGB images to achieve high performance. However, properly aggregating depth information and reducing noise and information loss during feature encoding after fusion are challenging aspects in scene semantic segmentation. To overcome these problems, we propose a pyramid gradual-guidance network for RGB-D indoor scene semantic segmentation. First, the quality of depth information is improved by a modality-enhancement fusion module and RGB image fusion. Then, the representation of semantic information is improved by multiscale operations. The two resulting adjacent features are used in a feature refinement module with an attention mechanism to extract semantic information. The features from adjacent modules are successively used to form an encoding pyramid, which can substantially reduce information loss and thereby ensure information integrity. Finally, we gradually integrate features at the same scale obtained from the encoding pyramid during decoding to obtain high-quality semantic segmentation. Experimental results obtained from two commonly used indoor scene datasets demonstrate that the proposed pyramid gradual-guidance network attains the highest level of performance in semantic segmentation, as compared to other existing methods.