Owing to rich spectral and spatial information, hyperspectral image (HSI) can be utilized for finely classifying different land covers. With the emergence of deep learning techniques, convolutional neural network (CNN), fully convolutional network (FCN), and recurrent neural network (RNN) have been widely applied in the field of HSI classification. Recently, transformer-based approaches represented by Vision Transformer (ViT) have yielded promising performance on numerous tasks and have been introduced to classify HSI. However, existing methods based on the above architectures still face three crucial issues that limit the classification performance: 1) geometric constraints caused by input data, 2) contribution fuzziness of central pixels with details, and 3) interaction gap between local areas and further environments. To tackle the above problems, an interactive learning framework inspired by ViT is proposed from a center to surrounding perspective, namely the center-to-surrounding interactive learning (CSIL) framework. Different from existing works, the CSIL framework enables to achieve multi-scale, detail-aware, and space-interactive classification based on a well-designed hierarchical region sampling strategy, center transformer, and surrounding transformer. Specifically, a hierarchical region sampling strategy is first proposed to flexibly generate the center region, neighbor region, and surrounding region, respectively. Thus, multi-scale input data breaks the geometric constraints. Second, a center transformer is presented to obtain core characteristics in detail based on the center region. In this way, central pixels are remarkably highlighted and the details are easily perceived. Third, a surrounding transformer including interactive self-attention learning is formulated for interacting both locally fine-grained distributions in the neighbor region and further coarse-grained environments in the surrounding region. With this structure, short- and long-term dependencies can be modeled, emphasized, and exchanged to bridge the interaction gap. Finally, the features from center transformer and surrounding transformer are integrated, then fed into a multi-layer perceptron for the optimization of semantic representation. Extensive experiments on six HSI datasets including small-, medium-, and large-scale scenes demonstrate the superiority over state-of-the-art CNN–, FCN-, RNN- and transformer-based approaches, even with very few training samples (for example 0.19% in complex HanChuan city scene). The source code will be available soon at https://github.com/jqyang22/CSIL.