Abstract

Scene perception technology helps robots identify the target areas that people refer to so that it contributes to human–robot interaction, semantic navigation and other related tasks. Currently, pure semantic-feature-based methods are insufficient to fully describe diverse indoor scenes, resulting in a high confusion rate and performance inconsistency on each class. To overcome these problems, the style information is compounded with the common semantic feature to form a more elaborate description of a scene, and the corresponding network is proposed. First, the convolutional network is adopted to extract the base feature. Then, the high-level feature maps are taken out and reasonably divided into overlapped units to reserve a more appropriate neighbour correlation. Next, two branches are proposed to acquire the style and the semantic information respectively. In the style branch, the gram matrix is applied to each unit. In the semantic branch, the units of high-level feature maps are directly used. In both branches, batch normalization and vector embedding techniques are applied to the flattened elements of the unit sets to unify the feature strength and introduce compression representation. Then, the two compressed representations are combined as a compound expression to describe a scene. A multi-head self-attention structure is introduced to correlate and reinforce the multi-divided information and further form a dominant and refined stylized semantic description. Finally, scene classification is implemented by a multi-layer fully connected network. The experiment adopts the paradigm of once learning and cross-environment inference, which is closer to practical applications. The proposed method performs the best compared with several popular methods in the field of robotics, and especially, it has the smallest classification bias. In addition, the necessity of the main components of our method is evaluated, and the semantic explanation is also presented.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call