Hyperspectral images (HSIs), acquired as a 3D data set, contain spectral and spatial information that is important for ground–object recognition. A 3D convolutional neural network (3DCNN) could therefore be more suitable than a 2D one for extracting multiscale neighborhood information in the spectral and spatial domains simultaneously, if it is not restrained by mass parameters and computation cost. In this paper, we propose a novel lightweight multilevel feature fusion network (LMFN) that can achieve satisfactory HSI classification with fewer parameters and a lower computational burden. The LMFN decouples spectral–spatial feature extraction into two modules: point-wise 3D convolution to learn correlations between adjacent bands with no spatial perception, and depth-wise convolution to obtain local texture features while the spectral receptive field remains unchanged. Then, a target-guided fusion mechanism (TFM) is introduced to achieve multilevel spectral–spatial feature fusion between the two modules. More specifically, multiscale spectral features are endowed with spatial long-range dependency, which is quantified by central target pixel-guided similarity measurement. Subsequently, the results obtained from shallow to deep layers are added, respectively, to the spatial modules, in an orderly manner. The TFM block can enhance adjacent spectral correction and focus on pixels that actively boost the target classification accuracy, while performing multiscale feature fusion. Experimental results across three benchmark HSI data sets indicate that our proposed LMFN has competitive advantages, in terms of both classification accuracy and lightweight deep network architecture engineering. More importantly, compared to state-of-the-art methods, the LMFN presents better robustness and generalization.