Abstract

The task of semantic segmentation plays a vital role in the analysis of remotely sensed imagery. Currently, this task is mainly solved using supervised pre-training, where very Deep Convolutional Neural Networks (DCNNs) are trained on large annotated datasets for mostly solving a classification problem. They are useful for many visual recognition tasks but heavily depend on the amount and quality of the annotations to learn a mapping function for predicting on new data. Motivated by the plethora of data generated everyday, researchers have come up with alternatives such as Self-Supervised Learning (SSL). These methods play a deciding role in boosting the progress of deep learning without the need of expensive labeling. They entirely explore the data, find supervision signals and solve a challenge known as Pretext Task (PTT) to learn robust representations. Thereafter, the learned features are transferred to resolve the so-called Downstream Task (DST), which can represent a group of computer vision applications such as classification or object detection. The current work explores the conception of a DCNN and training strategy to jointly predict on multiple PTTs in order to learn general visual representations that could lead more accurate semantic segmentations. The first Pretext Task is Image Colorization (IC) that identifies different objects and related parts present in a grayscale image to paint those areas with the right color. The second task is Spatial Context Prediction (SCP), which captures visual similarity across images to discover the spatial configuration of patches generated out of an image. The DCNN architecture is constructed considering each particular objective of the Pretext Tasks. It is subsequently trained and its acquired knowledge is transferred into a SSL trunk network to build a Fully Convolutional Network (FCN) on top of it. The FCN with SSL trunk learns a compound of features through fine-tuning to ultimately predict the semantic segmentation. With the aim of evaluating the quality of the learned representations, the performance of the trained model will be compared with inference results of a FCN-ResNet101 architecture pre-trained on ImageNet. This comparison employs the F1-Score as quality metric. Experiments show that the method is capable of achieving general feature representations that can definitely be employed for semantic segmentation purposes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call