Accurately identifying the fish feeding intensity plays a vital role in aquaculture. While traditional methods are limited by single modality (e.g., water quality, vision, audio), they often lack comprehensive representation, leading to low identification accuracy. In contrast, the multimodal fusion methods leverage the fusion of features from different modalities to obtain richer target features, thereby significantly enhancing the performance of fish feeding intensity assessment (FFIA). In this work a multimodal dataset called MRS-FFIA was introduced. The MRS-FFIA dataset consists of 7611 labelled audio, video and acoustic dataset, and divided the dataset into four different feeding intensity (strong, medium, weak, and none). To address the limitations of single modality methods, a Multimodal Fusion of Fish Feeding Intensity fusion (MFFFI) model was proposed. The MFFFI model is first extracting deep features from three modal data audio (Mel), video (RGB), Acoustic (SI). Then, image stitching techniques are employed to fuse these extracted features. Finally, the fused features are passed through a classifier to obtain the results. The test results show that the accuracy of the fused multimodal information is 99.26%, which improves the accuracy by 12.80%, 13.77%, and 2.86%, respectively, compared to the best results for single-modality (audio, video and acoustic dataset). This result demonstrates that the method proposed in this paper is better at classifying the feeding intensity of fish and can achieve higher accuracy. In addition, compared with the mainstream single-modality approach, the model improves 1.5%–10.8% in accuracy, and the lightweight effect is more obvious. Based on the multimodal fusion method, the feeding decision can be optimised effectively, which provides technical support for the development of intelligent feeding systems.
Read full abstract