In this article, we propose an augmented reality semiautomatic labeling (ARS), a semiautomatic method which leverages on moving a 2-D camera by means of a robot, proving precise camera tracking, and an augmented reality pen (ARP) to define initial object bounding box, to create large labeled data sets with minimal human intervention. By removing the burden of generating annotated data from humans, we make the deep learning technique applied to computer vision, which typically requires very large data sets, truly automated and reliable. With the ARS pipeline, we created two novel data sets effortlessly, one on electromechanical components (industrial scenario) and other on fruits (daily-living scenario) and trained two state-of-the-art object detectors robustly, based on convolutional neural networks, such as you only look once (YOLO) and single shot detector (SSD). With respect to conventional manual annotation of 1000 frames that takes us slightly more than 10 h, the proposed approach based on ARS allows to annotate 9 sequences of about 35 000 frames in less than 1 h, with a gain factor of about 450. Moreover, both the precision and recall of object detection is increased by about 15% with respect to manual labeling. All our software is available as a robot operating system (ROS) package in a public repository alongside with the novel annotated data sets. Note to Practitioners —This article was motivated by the lack of a simple and effective solution for the generation of data sets usable to train a data-driven model, such as a modern deep neural network, so as to make them accessible in an industrial environment. Specifically, a deep learning robot guidance vision system would require such a large amount of manually labeled images that it would be too expensive and impractical for a real use case, where system reconfigurability is a fundamental requirement. With our system, on the other hand, especially in the field of industrial robotics, the cost of image labeling can be reduced, for the first time, to nearly zero, thus paving the way for self-reconfiguring systems with very high performance (as demonstrated by our experimental results). One of the limitations of this approach is the need to use a manual method for the detection of objects of interest in the preliminary stages of the pipeline (ARP or graphical interface). A feasible extension, related to the field of collaborative robotics, could be used to exploit the robot itself, manually moved by the user, even for this preliminary stage, so as to eliminate any source of inaccuracy.
Read full abstract