Abstract Computer vision (CV) has been proposed as a powerful technology to collect individual measurements of livestock animals, such as body weight or body condition score, to name a few. In all these tasks, the first step of image processing is semantic segmentation (SS), which represents the use of deep neural networks to locate pixels that belong to the animal body, while removing the background which may add noise and other information that are not needed to compute body biometrics. Despite the generalization abilities of CV, oftentimes SS models need to be re-trained on the specific dataset at hand in order to maximize performance, and thus require labor-intensive annotations. With the rise of foundation models like GPT-4, LLaMA, DALL-E, among others, we aim to explore whether a foundation model for SS called SegGPT performs well in a highly specific agricultural scenario, in comparison with a model trained using domain-specific data, i.e., a U-net model trained on our datasets. Our evaluation is carried out over 9 different datasets of top-down depth images over bodies of calves and cows. The combined datasets amount to a total of 4,328 images (average count = 541) from 485 animals (average count = 54) from 2 wk to 7 yr of age, and it was collected with a mix of RealSense D435 and Kinect V2 sensors at several farm settings (entering/exiting the milking parlor, in a chute, or during weighing on a scale or weighing cart). We investigated the performance of these two models under several scenarios: training on a single dataset and validating within the same dataset (Same Dataset Internal Validation; SDIV), training on a single dataset and validating on an external dataset (Same Dataset External Validation; SDEV), and training on multiple datasets and validating on an external dataset (Multiple Datasets External Validation; MDEV). For the U-net trained in the SDIV approach, we used a 70/30 train/test split over 100 epochs. SegGPT and U-net presented intersection over union of 0.84 and 0.95, 0.73 and 0.59, 0.84 and 0.91, for SDIV, SDEV, and MDEV, respectively. U-net was not able to segment the animal body with performance similar to SegGPT in a new dataset when trained with a single dataset, but it outperformed SegGPT when trained with multiple datasets. Although U-net performed slightly better on these scenarios, it’s important to highlight that SegGPT only received one prompt (one depth image with its annotation) per dataset and performed reasonably well on unseen (validation) datasets. In conclusion, foundation models for image processing tasks can be an alternative for training domain-specific deep neural networks, which demands labor, annotation, and large datasets to present satisfactory results. However, high performance still requires domain-specific models in similar agricultural scenarios.
Read full abstract