Detecting digital audio tampering is essential for ensuring judicial fairness and societal security. Traditional methods, primarily based on Electric Network Frequency (ENF), have been limited by their reliance on singular, static features, which overlook critical temporal dynamics inherent in ENF data. This results in suboptimal detection accuracy. Addressing these limitations, this paper introduces a novel transformer model, ENFformer, designed for the detection of digital audio tampering by leveraging both short and long-term temporal features of ENF data. Initially, the ENFformer model extracts traditional zero-order and first-order phase features using discrete Fourier transforms (DFT0 and DFT1), along with frequency features obtained through the Hilbert transform. These features are processed using a frame-based algorithm to develop their temporal counterparts. To enhance feature extraction, the model employs a two-layer one-dimensional Convolutional Long Short-Term Memory (ConvLSTM) network to assimilate short-term temporal features, followed by a Bidirectional Long Short-Term Memory (BiLSTM) network for long-term feature integration. A branch attention mechanism then synergizes these long-term features, which are further refined by a transformer module for accurate tampered audio identification. Our empirical evaluations on the Carioca and ENF-EDIT1 databases demonstrate that ENFformer achieves detection accuracies of 97.33% and 93.50% respectively, surpassing existing state-of-the-art methods. These results confirm the effectiveness of our approach, which significantly advances the field of digital audio tampering detection by incorporating a comprehensive analysis of temporal information in ENF features. The source code of this study is publicly available at https://github.com/CCNUZFW/ENFformer.