With the growing integration of drones into various civilian applications, the demand for effective automatic drone identification (ADI) technology has become essential to monitor malicious drone flights and mitigate potential threats. While numerous convolutional neural network (CNN)-based methods have been proposed for ADI tasks, the inherent local connectivity of the convolution operator in CNN models severely constrains RF signal identification performance. In this paper, we propose an innovative hybrid transformer model featuring a CNN-based tokenization method that is capable of generating T-F tokens enriched with significant local context information, and complemented by an efficient gated self-attention mechanism to capture global time/frequency correlations among these T-F tokens. Furthermore, we underscore the substantial impact of incorporating phase information into the input of the SignalFormer model. We evaluated the proposed method on two public datasets under Gaussian white noise and co-frequency signal interference conditions, The SignalFormer model achieved impressive identification accuracy of 97.57% and 98.03% for coarse-grained identification tasks, and 97.48% and 98.16% for fine-grained identification tasks. Furthermore, we introduced a class-incremental learning evaluation to demonstrate SignalFormer's competence in handling previously unseen categories of drone signals. The above results collectively demonstrate that the proposed method is a promising solution for supporting the ADI task in reliable ways.