In light of the challenges imposed by fish behavior recognition, which arise from environmental noise and dim lighting in aquaculture environments and adversely affect the effectiveness of unimodal recognition methods based on either sound or visual cues, this paper proposes a fish behavior recognition model, Mul-SEResNet50, based on the fusion of audio and visual information. To address issues such as image blurring and indistinct sounds in aquaculture environments, which hinder the effectiveness of multimodal fusion and complementary modalities, a multimodal interaction fusion (MIF) module is introduced. This module integrates audio-visual modalities at multiple stages to achieve a more comprehensive joint feature representation. To enhance complementarity during the fusion process, we designed a U-shaped bilinear fusion structure to fully utilize multimodal information, capture cross-modal associations, and extract high-level features. Furthermore, to address the potential loss of key features, a temporal aggregation and pooling (TAP) layer is introduced to preserve more fine-grained features by extracting both the maximum and average values within pooling regions. To validate the effectiveness of the proposed model, both ablation experiments and comparative experiments are conducted. The results demonstrate that Mul-SEResNet50 achieves a 5.04 % accuracy improvement over SEResNet50 without sacrificing detection speed. Compared to the state-of-the-art U-FusionNet-ResNet50 +SENet model, Mul-SEResNet50 achieves accuracy and F1 score improvements of 0.47 % and 1.32 %, respectively. These findings confirm the efficacy of the proposed model in terms of accurately recognizing fish behavior, facilitating the precise monitoring of fish behavior.