Abstract

Automatic non-contact estimation of pig weight can avoid porcine stress and prevent the spread of swine fever. Many recent relevant works employ convolutional neural networks to extract deeply learned features for regressing pig weight based on single modality, either RGB images or depth images. However, utilizing only one modality may not be sufficient for pig-weight estimation, since both modalities are complementary for representing the spatial body information of pigs. In this paper, we propose a two-stream cross-attention vision Transformer for regressing pig weight based on both RGB and depth images. Specifically, we employ two separate Swin Transformer to extract texture appearance information and spatial structure information from RGB and depth images, respectively. Meanwhile, we design the cross-attention blocks to learn mutual-modal representations from both modalities. Finally, we construct a feature fusion layer to combine the features from both streams for regressing pig weight. In the experiments, we collect a new dataset of paired RGB-D pig images, which contains 10,263 RGB-D pairs for training and 5203 RGB-D pairs for testing. Comprehensive comparative experimental results show that the proposed method yields the best performance on this dataset, where the mean absolute error is 3.237.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.