Real-time estimation of fish biomass plays a crucial role in real-world fishery production, as it helps formulate feeding strategies and other management decisions. In this paper, a dense fish counting network called Swin-CSRNet is proposed. Specifically, the VGG16 layer in the front-end is replaced with the Swin transformer to extract image features more efficiently. Additionally, a squeeze-and-excitation (SE) module is introduced to enhance feature representation by dynamically adjusting the importance of each channel through “squeeze” and “excitation”, making the extracted features more focused and effective. Finally, a multi-scale fusion (MSF) module is added after the back-end to fully utilize the multi-scale feature information, enhancing the model’s ability to capture multi-scale details. The experiment demonstrates that Swin-CSRNet achieved excellent results with MAE, RMSE, and MAPE and a correlation coefficient R2 of 11.22, 15.32, 5.18%, and 0.954, respectively. Meanwhile, compared to the original network, the parameter size and computational complexity of Swin-CSRNet were reduced by 70.17% and 79.05%, respectively. Therefore, the proposed method not only counts the number of fish with higher speed and accuracy but also contributes to advancing the automation of aquaculture.