ABSTRACT The Transformer has become pivotal for the integrated analysis of multi-source remote-sensing (RS) data in Earth observation, particularly in applications such as the fusion classification of hyperspectral images (HSI) and Light Detection and Ranging (LiDAR) data. However, Transformers are often employed as effective feature extractors by adopting similar processing blocks for different modalities from multi-source sensors, overlooking differences in imaging principles and data characteristics. Moreover, in the feature extraction process across different sensor data, there is a lack of necessary cross-modal information interaction, leading to insufficient utilization of complementary information between different sensors and resulting in suboptimal fusion outcomes. In this paper, we propose an interactive Transformer and CNN network for the fusion classification of HSI and LiDAR data. Specifically, a heterogeneous three-branch network architecture is designed for HSI and LiDAR data, where Transformers and CNNs encapsulate global contextual spatial and spectral information for HSI and capture geometric elevation patterns for LiDAR data, respectively. Elevation-Spatial Interaction (ESI) and Spectral-Spatial Interaction (SSI) modules are then introduced for multi-stage feature interaction. ESI enables the CNN-Transformer network to focus on essential local elevation details while simultaneously modelling global contextual spatial information. SSI facilitates the Transformer-Transformer network to cyclically intertwine spectral and spatial information for long-range spectral-spatial feature fusion. Finally, the interacted elevation, spatial, and spectral features undergo the Gated Fusion module to achieve hierarchical fusion adaptively, resulting in an elevation-spatial-spectral representation. Experiments conducted on three benchmark HSI-LiDAR datasets demonstrate the effectiveness of our proposed approach.