Multi-modal remote sensing (RS) image segmentation aims to comprehensively utilize multiple RS modalities to assign pixel-level semantics to the studied scenes, which can provide a new perspective for global city understanding. Multi-modal segmentation inevitably encounters the challenge of modeling intra- and inter-modal relationships, i.e., object diversity and modal gaps. However, the previous methods are usually designed for a single RS modality, limited by the noisy collection environment and poor discrimination information. Neuropsychology and neuroanatomy confirm that the human brain performs the guiding perception and integrative cognition of multi-modal semantics through intuitive reasoning. Therefore, establishing a semantic understanding framework inspired by intuition to realize multi-modal RS segmentation becomes the main motivation of this work. Drived by the superiority of hypergraphs in modeling high-order relationships, we propose an intuition-inspired hypergraph network (I2HN) for multi-modal RS segmentation. Specifically, we present a hypergraph parser to imitate guiding perception to learn intra-modal object-wise relationships. It parses the input modality into irregular hypergraphs to mine semantic clues and generate robust mono-modal representations. In addition, we also design a hypergraph matcher to dynamically update the hypergraph structure from the explicit correspondence of visual concepts, similar to integrative cognition, to improve cross-modal compatibility when fusing multi-modal features. Extensive experiments on two multi-modal RS datasets show that the proposed I2HN outperforms the state-of-the-art models, achieving F1/mIoU accuracy 91.4%/82.9% on the ISPRS Vaihingen dataset, and 92.1%/84.2% on the MSAW dataset. The complete algorithm and benchmark results will be available online.