Fish and shrimp species classification is a practical need in the field of aquaculture. Traditional classification method includes extracting modal features of the single image and training them on downstream datasets. However, this method has the disadvantage of requiring manual annotated image data and significant training time. To address these issues, this paper introduces a method named CLIP-FSSC (Contrastive Language–Image Pre-training for Fish and Shrimp Species Classification) for zero-shot prediction using a pre-trained model. The proposed method aims to classify fish and shrimp species in the field of aquaculture using a multimodal pre-trained model that utilizes semantic text description as an image supervision signal for transfer learning. In the downstream fish dataset, we use natural language labels for three types of fish - grass carp, common carp, and silver carp. We extract text category features using a transformer and compare the results obtained from three different CLIP-based backbones for the image modality - vision transformer, Resnet50, and Resnet101. We compare the performance of these models with previous methods that performed well. After performing zero-shot predictions on samples of the three types of fish, we achieve similar or even better classification accuracy than models trained on downstream fish datasets. Our experiment results show an accuracy of 98.77%, and no new training process is required. This proves that using the semantic text modality as the label for the image modality can effectively classify fish species. To demonstrate the effectiveness of this method on other species in the field of aquaculture, we collected two sets of shrimp data - prawn and cambarus. Through zero-shot prediction, we achieve the highest classification accuracy of 92.00% for these two types of shrimp datasets. Overall, our results demonstrate that using a multimodal pre-trained model with semantic text description as an image supervision signal for transfer learning can effectively classify fish and shrimp species with high accuracy, while reducing the need for manual annotation and training time.