The automatic detection of transmittance surfaces (transparency and translucency) and material class identification (glass, plastic, and so on) from an image is crucial for applications, including domestic service robotics and handling fragile instruments in experimental or industrial settings. However, this is a difficult task due to the lack of expressive textures in such materials. Although previous approaches for detecting transmittance surfaces propose handcrafted features, they disregard global material information and are limited to a single item (usually glass). Material or object recognition algorithms based on convolutional neural networks (CNNs) focus on optimizing performance that requires large labeled datasets. These algorithms are better for detecting opaque objects because large architectures tend to generalize the texture from an image’s background layer. Furthermore, creating large-scale off-the-shelf dataset labeled with transmittance surface is an extremely tedious task. Thus, we propose a simple yet effective and efficient learning model that combines a shallow multitask vision transformer (ViT) with a scale-invariant feature transform (SIFT) and a shallow CNN as the backbone network. SIFT and ViT are used to capture local and global discriminative features, respectively, and their fusion can increase the performance on a small dataset. The backbone network is utilized to introduce the desirable properties of CNN. Using limited training data, this model detects transmittance surfaces and determines their material type. We conducted tests on a new benchmark dataset containing a variety of transmissive materials and three publicly available datasets for robust comparison. Results show that the proposed model achieved a new state-of-the-art accuracy of 81.78% and 77.21% on our new dataset for transmittance surface recognition and material class identification, respectively.
Read full abstract