Birds play a pivotal role in ecosystem and biodiversity research, and accurate bird identification contributes to the monitoring of biodiversity, understanding of ecosystem functionality, and development of effective conservation strategies. Current methods for bird sound recognition often involve processing bird songs into various acoustic features or fusion features for identification, which can result in information loss and complicate the recognition process. At the same time, the recognition method based on raw bird audio has not received widespread attention. Therefore, this study proposes a bird sound recognition method that utilizes multiple one-dimensional convolutional neural networks to directly learn feature representations from raw audio data, simplifying the feature extraction process. We also apply positional embedding convolution and multiple Transformer modules to enhance feature processing and improve accuracy. Additionally, we introduce a trainable weight array to control the importance of each Transformer module for better generalization of the model. Experimental results demonstrate our model’s effectiveness, with an accuracy rate of 99.58% for the public dataset Birds_data, as well as 98.77% for the Birdsonund1 dataset, and 99.03% for the UrbanSound8K environment sound dataset.