The joint classification of hyperspectral image (HSI) and Light Detection and Ranging (LiDAR) data can provide complementary information for each other, which has become a prominent topic in the field of remote sensing. Nevertheless, the common CNN-based fusion techniques still suffer from the following drawbacks. (1) Most of these models omit the correlation and complementarity between different data sources and always fail to model the long-distance dependencies of spectral information well. (2) Simply splicing the multi-source feature embeddings overlooks the deep semantic relationships among them. To tackle these issues, we propose a novel graph-attention based multimodal fusion network (GAMF). Specifically, it employs three major components, including an HSI–LiDAR feature extractor, a graph-attention based fusion module and a classification module. In the feature extraction module, we consider the correlation and complementarity between multi-sensor data by parameter sharing and employ Gaussian tokenization for feature transformation additionally. To address the problem of long-distance dependencies, the deep fusion module utilizes modality-specific tokens to construct an undirected weighted graph, which is essentially a heterogeneous graph. And the deep semantic relationships between them are exploited utilizing a graph-attention based fusion framework. At the end, two fully connected layers classify the fused embeddings. Experiment evaluations on several benchmark HSI–LiDAR datasets (Trento, University of Houston 2013 and MUUFL) show that GAMF achieves more accurate prediction results than some state-of-the-art baselines. The code is available at https://github.com/tyust-dayu/GAMF.