A multi-modal deep network for the segmentation of RGB-D images of clothes A group of researchers working in Belgium present a deep-learning based method for the segmentation of clothing worn by RGB-D human models. Over 50,000 samples of characters wearing different clothing styles were used as the dataset for the method, with the characters exhibiting different poses and existing in different environments giving a total of 9 semantic classes. A novel multi-modal encoder-decoder convolutional neural network (CNN) is used to segment and identify the clothing. Computer vision uses clothing segmentation to identify various clothing according to individual pixels being matched with specific garments from a database. This is useful for a variety of reasons, including automatic product tagging and ‘virtual changing rooms’ which can both enhance customer experience by virtue of better product suggestion and visualisation of how those products would look when worn by the customer. Block schematic of the proposed network architecture. Samples from the proposed RGB-D dataset. There have already been several works focusing on the segmentation of clothing, and it is well known that CNNs provide excellent performance when carrying out labelling tasks involving large volumes of information. However, until now, these methods have all relied on RGB images only, made up additively of red, green and blue light. The work presented in Electronics Letters finds its originality in being an image segmentation method using RGB-D, a combination of an RGB image with its corresponding depth image. By incorporating the depth of an image into their methodology, the authors hope to increase the segmentation accuracy in image regions containing difficult textures or shadows. There are a multitude of clothing segmentation datasets available already, but unfortunately these could not be used as they are RGB only, without the corresponding depth image. Thus, the research team had to first develop a new dataset made entirely of RGB-D images, yielding one of the two contributions reported in the Letter. Joukovsky et al. developed a data-generation pipeline enabling low-cost production RGB, depth and ground-truth label maps on a large scale. More than 50,000 3D-rendered samples of characters in various poses, environments and randomised clothing styles have been produced, with 9 categories defined. This data-generation pipeline was then coupled with the team's second contribution in the Letter, the Multi-modal Deeplab v3+, a deep neural network that can perform multi-modal semantic segmentation, behaving as a RGB-D extension of the pre-existing Deeplab v3+ baseline. The network is formed of a 2-encoder-1-decoder set-up with feature fusion modules. The model was trained and quantitatively evaluated using the previously mentioned data-generation pipeline. The network uses multi-modal fusion (MMF) combined with an atrous spatial pyramidal pooling (ASPP) module. This so-called MMF-ASPP efficiently merges the RBG and depth modalities, giving rise to a 3.4% mean intersection-over-union score improvement over MMF alone, showing that inclusion of image depth can lead to improvements in performance. The network was also tested on data gathered from a Kinect v2, where the depth input had been pre-processed using a depth inpainting method. The network demonstrated good generalisation to multi-subject inputs and varying postures, even when trained on single-subject images. This could prove to be interesting for future works. The research group has shown that their neural network can efficiently and more effectively segment clothing from a humanoid figure by using depth image data for an RGB image. They have also produced a data-generation pipeline of RGB-D images which can be used by future research groups to generate RGB-D images more easily. The pipeline could be improved by extending the proposed dataset with images including occlusions and noisy depth data, which could then be used to more extensively train the neural network. With some additions, the data-generation pipeline could also be used to generate pose-estimation datasets for further research.
Read full abstract