Complementarily fusing RGB and depth images while effectively suppressing task-irrelevant noise is crucial for achieving accurate indoor RGB-D semantic segmentation. In this paper, we propose a novel deep model that leverages dual-modal non-local context to guide the aggregation of complementary features and the suppression of noise at multiple stages. Specifically, we introduce a dual-modal non-local context encoding (DNCE) module to learn global representations for each modality at each stage, which are then utilized to facilitate cross-modal complementary clue aggregation (CCA). Subsequently, the enhanced features from both modalities are merged together. Additionally, we propose a semantic guided feature rectification (SGFR) module to exploit rich semantic clues in the top-level merged features for suppressing noise in the lower-stage merged features. Both the DNCE-CCA and the SGFR modules provide dual-modal global views that are essential for effective RGB-D fusion. Experimental results on two public indoor datasets, NYU Depth V2 and SUN-RGBD, demonstrate that our proposed method outperforms state-of-the-art models of similar complexity.