Two-stream cross-attention vision Transformer based on RGB-D images for pig weight estimation

Wei He,Yang Mi,Xiangdong Ding,Gang Liu,Tao Li

doi:10.1016/j.compag.2023.107986

Abstract

Automatic non-contact estimation of pig weight can avoid porcine stress and prevent the spread of swine fever. Many recent relevant works employ convolutional neural networks to extract deeply learned features for regressing pig weight based on single modality, either RGB images or depth images. However, utilizing only one modality may not be sufficient for pig-weight estimation, since both modalities are complementary for representing the spatial body information of pigs. In this paper, we propose a two-stream cross-attention vision Transformer for regressing pig weight based on both RGB and depth images. Specifically, we employ two separate Swin Transformer to extract texture appearance information and spatial structure information from RGB and depth images, respectively. Meanwhile, we design the cross-attention blocks to learn mutual-modal representations from both modalities. Finally, we construct a feature fusion layer to combine the features from both streams for regressing pig weight. In the experiments, we collect a new dataset of paired RGB-D pig images, which contains 10,263 RGB-D pairs for training and 5203 RGB-D pairs for testing. Comprehensive comparative experimental results show that the proposed method yields the best performance on this dataset, where the mean absolute error is 3.237.

Full Text