Rotation-equivariant transformer for oriented person detection of overhead fisheye images

You Zhou,Yong Bai,Yongqing Chen

doi:10.1007/s40747-023-01176-3

Abstract

Overhead fisheye images can be used for person detection in intelligent monitoring systems. Unlike horizontal images, people in fisheye cameras are generally distributed in any orientation. When the object is rotated, the feature maps from convolutional neural networks have nonlinear variations and lose many orientation features. Transformer can learn the orientation relationship between features. However, a transformer cannot directly extract orientation features and the effectiveness of detecting small objects needs to be improved. In this paper, We propose a novel rotation-equivariant transformer backbone network, which combines group-equivariant convolution with swin transformer to solve these problems. In our proposed model, the rotation-equivariant feature map extracted by group-equivariant convolution contains a large number of orientation features in multiple directions. Aggregates feature in different directions to enhance the communication of orientation features before computing window self-attention. We propose the equivariant-group relation module for evaluating the similarity of the equivariant-group and calculating the aggregation weights. Our network architecture for multi-level receptive field structure can expand the local receptive field to enhance the detection of small objects. The experiments validate that our model achieves state-of-the-art performance on fisheye image datasets MW-R, HABBOF, and CEPDOF. Compared with the swin transformer, the accuracy of our model is improved by 0.3%\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\\%$$\\end{document}, 0.5%\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\\%$$\\end{document}, and 1.3%\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\\%$$\\end{document}, and the accuracy of small object detection in the CEPDOF dataset is improved by 0.73%\\documentclass[12pt]{minimal} \\usepackage{amsmath} \\usepackage{wasysym} \\usepackage{amsfonts} \\usepackage{amssymb} \\usepackage{amsbsy} \\usepackage{mathrsfs} \\usepackage{upgreek} \\setlength{\\oddsidemargin}{-69pt} \\begin{document}$$\\%$$\\end{document}.

Full Text