Indoor 3D Semantic Robot VSLAM Based on Mask Regional Convolutional Neural Network

Chongben Tao,Zhen Gao,Jinli Yan,Guozeng Cui,Chunguang Li

doi:10.1109/access.2020.2981648

Chongben Tao, Zhen Gao + Show 3 more

Open Access

PDF Available

https://doi.org/10.1109/access.2020.2981648

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

During the construction of indoor environmental semantic maps by robot Vision SLAM (VSLAM), there exist some problems such as low label classification accuracy and low precision under the situation of sparse feature points. In this case, this paper proposes an indoor three-dimensional semantic VSLAM algorithm based on Mask Regional Convolutional Neural Network (RCNN). Firstly, an Oriented FAST and a Rotated BRIEF (ORB) algorithms are used to extract image feature points. Secondly, a Random Sample Consensus (RANSAC) algorithm is employed to eliminate mismatched points and estimate camera position-pose changes. Then, a Mask RCNN algorithm is applied to make partial adjustments to its hyper parameter. A self-made data set is used to transfer learning, fulfilling real-time target detection and instance segmentation of a scene. A three-dimensional semantic map is constructed in combination with VSLAM algorithm. The semantic information in the environment not only improves the accuracy of VSLAM construction and positioning, but also reduces the impact of object movement on the construction by marking movable objects. Meanwhile, the VSLAM algorithm is used to calculate the positional constraints between objects and improve the accuracy of semantic understanding. Finally, by comparing with other methods, it demonstrates that this method is more correct and effective. It was also verified that the proposed method can accurately interpret the semantic information in environment for the construction of three-dimensional semantic maps.

Highlights

The process of utilizing the established map to synchronize and update its position by robots in unknown location and environmental map is called the Simultaneous Localization and Mapping (SLAM) problem
Visual Simultaneous Localization and Mapping (VSLAM) is a SLAM problem that uses a camera as the only sensor
Whelan et al proposed a construction method for dense threedimensional semantic map called Semantic Fusion based on convolutional neural network, which relied on Elastic Fusion SLAM algorithm to provide indoor RGB-D video inter-frame for position-pose estimation, utilizing convolutional neural network to predict pixel-level object category label which combined the Bayesian upgrade strategy with the conditional random field model

Summary

INTRODUCTION

The process of utilizing the established map to synchronize and update its position by robots in unknown location and environmental map is called the Simultaneous Localization and Mapping (SLAM) problem. Whelan et al proposed a construction method for dense threedimensional semantic map called Semantic Fusion based on convolutional neural network, which relied on Elastic Fusion SLAM algorithm to provide indoor RGB-D video inter-frame for position-pose estimation, utilizing convolutional neural network to predict pixel-level object category label which combined the Bayesian upgrade strategy with the conditional random field model. It realized the probability escalation of CNN prediction values from different perspectives, and generated a dense three-dimensional semantic map containing semantic information [10]. AP stands for Average Precision, which is an indicator of the accuracy of the target detection algorithm

CONSTRUCTION OF SEMANTIC MAP

Findings

CONCLUSION