Hyperspectral image (HSI) classification plays a crucial role in remote sensing (RS) applications, enabling the precise identification of materials and land cover based on spectral information. This supports tasks such as agricultural management and urban planning. While sequential neural models like Recurrent Neural Networks (RNNs) and Transformers have been adapted for this task, they present limitations: RNNs struggle with feature aggregation and are sensitive to noise from interfering pixels, whereas Transformers require extensive computational resources and tend to underperform when HSI datasets contain limited or unbalanced training samples. To address these challenges, Mamba architectures have emerged, offering a balance between RNNs and Transformers by leveraging lightweight, parallel scanning capabilities. Although models like Vision Mamba (ViM) and Visual Mamba (VMamba) have demonstrated improvements in visual tasks, their application to HSI classification remains underexplored, particularly in handling land-cover semantic tokens and multi-scale feature aggregation for patch-wise classifiers. In response, this study introduces the Mamba-in-Mamba (MiM) architecture for HSI classification, marking a pioneering effort in this domain. The MiM model features: (1) a novel centralized Mamba-Cross-Scan (MCS) mechanism for efficient image-to-sequence data transformation; (2) a Tokenized Mamba (T-Mamba) encoder that incorporates a Gaussian Decay Mask (GDM), Semantic Token Learner (STL), and Semantic Token Fuser (STF) for enhanced feature generation; and (3) a Weighted MCS Fusion (WMF) module with a Multi-Scale Loss Design for improved training efficiency. Experimental results on four public HSI datasets—Indian Pines, Pavia University, Houston2013, and WHU-Hi-Honghu—demonstrate that our method achieves an overall accuracy improvement of up to 3.3%, 2.7%, 1.5%, and 2.3% over state-of-the-art approaches (i.e., SSFTT, MAEST, etc.) under both fixed and disjoint training-testing settings.