Abstract

Explicit structural inference is one key point to improve the accuracy of scene parsing. Meanwhile, adversarial training method is able to reinforce spatial contiguity in output segmentations. To take both advantages of the structural learning and adversarial training simultaneously, we propose a novel deep learning network architecture called Structural Inference Embedded Adversarial Networks (SIEANs) for pixel-wise scene labeling. The generator of our SIEANs, a novel designed scene parsing network, makes full use of convolutional neural networks and long short-term memory networks to learn the global contextual information of objects in four different directions from RGB-(D) images, which is able to describe the (three-dimensional) spatial distributions of objects in a more comprehensive and accurate way. To further improve the performance, we explore the adversarial training method to optimize the generator along with a discriminator, which can not only detect and correct higher-order inconsistencies between the predicted segmentations and corresponding ground truths, but also exploit full advantages of the generator by fine-tuning its parameters so as to obtain higher consistencies. The experimental results demonstrate that our proposed SIEANs is able to achieve a better performance on PASCAL VOC 2012, SIFT FLOW, PASCAL Person-Part, Cityscapes, Stanford Background, NYUDv2, and SUN-RGBD datasets compared to the most of state-of-the-art methods.

Highlights

  • Scene parsing, one of the most fundamental tasks in computer vision, aims at predicting a class label for every pixel of input images, which can be beneficial to a wide scope of intelligent applications, including image-to-caption generation [1], robot task planning [2], action recognition [3], self-driving cars [4], and automatic photo adjustment [5]

  • For PASCAL VOC 2012 dataset, we measure the performance of the Structural Inference Embedded Adversarial Networks (SIEANs) by the mean intersection over union (IoU) [11]

  • In the table, ‘Convolutional Neural Networks (CNNs)’ means the accuracy of scene parsing achieved by the feature learning layer via the standard training method, ‘CNNs+long short-term memory networks (LSTMs)’ means the accuracy obtained by the structural learning layer via the standard training method, ‘SIEANs_STD’ means the accuracy achieved by the SIEANs via the standard training method, and ‘SIEANs’ means the accuracy obtained by the SIEANs via the adversarial training method

Read more

Summary

Introduction

One of the most fundamental tasks in computer vision, aims at predicting a class label for every pixel of input images, which can be beneficial to a wide scope of intelligent applications, including image-to-caption generation [1], robot task planning [2], action recognition [3], self-driving cars [4], and automatic photo adjustment [5]. A real scene always contains multiple categories of objects, and the appearances of objects are diverse. Scene parsing belongs to a challenging pixel-level multi-label classification task, which cares about the visual appearances of objects, and takes into account the spatial dependencies among objects. There are two key issues affecting the accuracy of scene parsing in the latest researches: (1) How to extract the effective representations from input. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call