The dark web, as an integral part of the multi-layered structure of the internet, provides anonymity and high levels of concealment absent on the surface web. Users can engage in various online activities without leaving any trace, making the dark web a hub for illicit activities, such as drug and weapon trafficking. Consequently, this poses a significant threat to social order and network security. However, due to the high concealment of the dark web, traditional detection methods suffer with insufficient extraction of characteristic information from darknet traffic and inadequate consideration of feature correlations. As a result, the accuracy of these conventional detection methods in detecting the dark web in real network environments is suboptimal. This paper proposes a darknet traffic detection framework called DarkMor to address these challenges. It integrates local and spatial features, automates feature mining and fusion, and models spatial relationship between features to fully exploit their potential. The core components of DarkMor consist of the feature fusion module and the traffic perception module. Using an improved feature tokenizer transformer architecture, the feature fusion module enhances the extraction of local features within high-dimensional feature clusters, effectively combining local feature information with global context. Additionally, the traffic perception module leverages a temporal model that incorporates self-attention mechanisms to learn the spatiotemporal characteristics of fused features, thereby further enhancing the model’s detection. Experimental results demonstrated that DarkMor achieved an accuracy of 97.78% on real network datasets, surpassing the latest cross-modal darknet traffic detection models. Furthermore, DarkMor maintained an accuracy of 97.57% even in network environments with reduced training samples, confirming the feasibility and robustness of the proposed detection framework.