Abstract

First person videos and games are the central paradigms of camera positioning when using Head Mounted Displays (HMDs). In these situations, the user’s hands and arms play a fundamental role in self-presence feeling and interface. While their visual image is trivial in Augmented Reality devices or when using depth cameras attached to the HMDs, their rendering is not trivial to be solved with regular HMD, such as those based on smartphone devices. This work proposes the usage of semantic image segmentation with Fully Convolutional Networks for detaching user’s hands and arms from a raw image, captured by regular cameras, positioned as a First Person visual schema. We first create a training dataset composed by 4041 images and a validation dataset composed of 322 images, both of them receive labels for an arm and no-arm pixels, focused on the egocentric view. Then, based on two important architectures related to semantic segmentation - PSPNet and Deeplab - we propose a specific calibration for the particular scenario composed of hands and arms, captured from an HMD perspective. Our results show that PSPNet has better detail segmentation while Deeplab achieves best inference time performance. Training with our egocentric dataset generates better arm segmentation than using images in different and more general perspectives.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call