The Dongba manuscripts are a unique primitive pictographic writing system that originated among the Naxi people of Lijiang, China, boasting over a thousand years of history. The uniqueness of the Dongba manuscripts stems from their pronounced pictorial and ideographic characteristics. However, the digital preservation and inheritance of Dongba manu manuscripts face multiple challenges, including extracting its rich semantic information, recognizing individual characters, retrieving Dongba manuscripts, and automatically interpreting the meanings of Dongba manuscripts. Developing efficient Dongba character detection technology has become a key research focus, wherein establishing a standardized Dongba detection dataset is crucial for training and evaluating techniques. In this study, we have created a comprehensive Dongba manuscripts detection dataset covering various commonly used Dongba characters and vocabularies. Additionally, we propose a model named STEF. Firstly, the Swin Transformer extracts the complex structures and diverse shapes of Dongba manuscripts’ features. Then, by introducing a Feature Pyramid Enhancement Module, features of different sizes are cascaded to preserve multi-scale information. Subsequently, all features are fused in a FUSION module, resulting in features of various Dongba manuscript styles. Each pixel’s binarisation threshold is dynamically adjusted through a differentiable binarisation operation, accurately distinguishing between foreground Dongba manuscripts and background. Lastly, deformable convolution is introduced, allowing the model to dynamically adjust the convolution kernel’s size and shape based on the Dongba manuscripts’ size, thereby better capturing the detailed information of Dongba characters of different sizes. Experimental results show that STEF achieves a recall rate of 88.88%, a precision rate of 88.65%, and an F-measure of 88.76%, outperforming other text detection algorithms. Visualization experiments demonstrate that STEF performs well in detecting Dongba manuscripts of various sizes, shapes, and styles, especially in blurred handwriting and complex backgrounds.