Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet.

Similar Papers
  • Conference Article
  • Cite Count Icon 10
  • 10.1109/iccsnt.2016.8070266
SCNet: A simplified encoder-decoder CNN for semantic segmentation
  • Dec 1, 2016
  • Robail Yasrab + 2 more

We present a simplified and novel fully convolutional neural network (CNN) architecture for semantic pixel-wise segmentation named as SCNet. Different from current CNN pipelines, proposed network uses only convolution layers with no pooling layer. The key objective of this model is to offer a more simplified CNN model with equal benchmark performance and results. It is an encoder-decoder based fully convolution network model. Encoder network is based on VGG 16-layer while decoder networks use upsampling and deconvolution units followed by a pixel-wise classification layer. The proposed network is simple and offers reduced search space for segmentation by using low-resolution encoder feature maps. It also offers a great deal of reduction in trainable parameters due to reusing encoder layer's sparse features maps. The proposed model offers outstanding performance and enhanced results in terms of architectural simplicity, number of trainable parameters and computational time.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/icoei51242.2021.9453022
Seg-Net: Automatic Lung Infection Segmentation of COVID-19 from CT images
  • Jun 3, 2021
  • G Rajakumar + 5 more

COVID-19 is a deadly disease which causes infection in both animals and human beings. It is a zoonotic disease that scatters worldwide in the beginning of the year 2020. COVID-19 is termed as Coronavirus Disease 2019 that makes the whole world to suffer from this existential infection. The lung contamination is found automatically by chest Computed Tomography images that help to tackle COVID-19. During the separation of the diseased portion from the X-ray slices, it produces lots of demands which include huge difference in the disease attribute and low intensity difference in the middle of infected tissue and usual tissues. The collection of huge quantity of information is impossible in a short period of time and pedagogy of the deep model. For overcoming the Lung disease separation of COVID-19 by using S eg-Net is suggested to analyze the affected portions automatically from chest X-ray scan. Here, Convolutional Neural Network (CNN) architecture for semantic pixel-wise segmentation named as Semantic Network is utilized. S eg-Net segmentation is a core trainable engine that contains an encoder network and also a corresponding decoder network that is continued by a pixel-wise classification layer. The structure of the encoder network is physiographic and it is equal with the 13 convolutional layers in the Visual Geometry Group 16 network. The originality of the semantic network is located in this method of decoder up samples with the lower resolution input map features. Exactly, the pooling was applied by the decoder that indicates max pooling process in the corresponding encoder for behaving like the non-linear up sampling. Comprehensive observation in COVID-19 real CT volumes and the SemiSeg are determined and it is suggested that the Semantic network performs the cut-ting edge segmentation models, and then it promotes the state in the art presentation.

  • Conference Article
  • Cite Count Icon 15
  • 10.31256/ukras19.12
Underwater Scene Segmentation by Deep Neural Network
  • Jan 24, 2019
  • Journal of robotics & autonomous systems
  • Yang Zhou + 5 more

A deep neural network architecture is proposed in this paper for underwater scene semantic segmentation. The architecture consists of encoder and decoder networks. Pretrained VGG-16 network is used as a feature extractor, while the decoder learns to expand the lower resolution feature maps. The network applies max un-pooling operator to avoid large number of learnable parameters, and, in order to make use of the feature maps in encoder network, it concatenates the feature maps with decoder and encoder for lower resolution feature maps. Our architecture shows capabilities of faster convergence and better accuracy. To get a clear view of underwater scene, an underwater enhancement neural network architecture is described in this paper and applied for training. It speeds up the training process and convergence rate in training.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 32
  • 10.54097/rfa5x119
SegNet Network Architecture for Deep Learning Image Segmentation and Its Integrated Applications and Prospects
  • Feb 26, 2024
  • Academic Journal of Science and Technology
  • Chenwei Zhang + 4 more

Semantic image segmentation is a crucial task in computer vision, with applications ranging from autonomous driving to medical image analysis. In recent years, deep learning has revolutionized this field, leading to the development of various neural network models aimed at improving segmentation accuracy. One such architecture is SegNet, which we explore in this article.SegNet's architecture consists of an encoder network, a corresponding decoder network, and a pixel-wise classification layer. The encoder network, resembling VGG16 with 13 convolutional layers, extracts high-level features from input images. The innovation lies in the decoder network's approach to upsampling, utilizing pooled indices from the encoder's maximum pooling step to perform non-linear up sampling. This eliminates the need for additional learning during up sampling, making SegNet efficient in both storage and computation.SegNet represents an exciting advancement in deep learning image segmentation. Its efficient architecture, memory-conscious design, and potential for real-time applications make it a valuable tool in the field of computer vision with promising integrated applications and prospects.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/iccvw.2019.00241
Cross-Granularity Attention Network for Semantic Segmentation
  • Oct 1, 2019
  • Lingyu Zhu + 3 more

Despite the remarkable progress of semantic segmentation in recent years, much remains to be addressed in order to achieve better semantic coherence and boundary delineation. In this paper, we propose a novel convolutional neural network (CNN) architecture for semantic segmentation which explicitly addresses these two issues. Specifically, we propose a categorical attention mechanism to propagate consistent category-oriented information across multi-granularity contextual interpretations to close the semantic gap residing in CNN feature hierarchy. This novel design alleviates the semantic information loss during the feature combination and transformation process in decoder network. We further integrate a contour branch in our architecture to enhance the boundary awareness of the semantic feature derived in the form of a novel element-wise contour attention at each level of feature hierarchy. Additionally, we introduce a cross-granularity contour enhancement mechanism to propagate rich boundary cues from early layers to deep layers. We perform extensive quantitative evaluations in close proximity to object boundaries which confirms its superior effectiveness in boundary delineation. These novel mechanisms which boost the essentials in segmentation, i.e., region-wise semantic coherence and accurate object contour localization, allow our architecture MeshNet to obtain state-of-the-art performance on two challenging datasets, i.e., PASCAL VOC 2012 and Cityscapes.

  • Conference Article
  • 10.1109/icinfa.2017.8079023
Three-Skips CNN for road scene semantic segmentation
  • Jul 1, 2017
  • Jing Tang + 1 more

In this paper we propose a deep learning architecture to make the best use of global and local information for pixel-wise semantic segmentation. The architecture of three-skips CNN is built with convolutional layers in VGG16 network and its mirrored convolutional layers. Our architecture aims to road scene understanding. In order to save memory and computational time, we use unpooling layers to map low resolution feature maps to the input resolution. We introduce three skip architectures which combine local information and global information to produce accurate and detailed segmentations. Besides, we present the median balance method to deal with class unbalance problem in road scene datasets. Thorough evaluations on CamVid dataset demonstrate our approach has state-of-the-art performance and less computational time.

  • Book Chapter
  • Cite Count Icon 16
  • 10.1007/978-3-319-75238-9_23
Residual Encoder and Convolutional Decoder Neural Network for Glioma Segmentation
  • Jan 1, 2018
  • Kamlesh Pawar + 3 more

A deep learning approach to glioma segmentation is presented. An encoder and decoder pair deep learning network is designed which takes T1, T2, T1-CE (contrast enhanced) and T2-Flair (fluid attenuation inversion recovery) images as input and outputs the segmented labels. The encoder is a 49 layer deep residual learning architecture that encodes the \(240\,\times \,240\,\times \,4\) input images into \(8\,\times \,8\,\times \,2048\) feature maps. The decoder network takes these feature maps and extract the segmented labels. The decoder network is fully convolutional network consisting of convolutional and upsampling layers. Additionally, the input images are downsampled using bilinear interpolation and are inserted into the decoder network through concatenation. This concatenation step provides spatial information of the tumor to the decoder, which was lost due to pooling/downlsampling during encoding. The network is trained on the BRATS-17 training dataset and validated on the validation dataset. The dice score, sensitivity and specificity of the segmented whole tumor, core tumor and enhancing tumor is computed on validation dataset. The mean dice score for whole tumor, core tumor and enhancing tumor for validation dataset were 0.824, 0.627 and 0.575, respectively.

  • Research Article
  • Cite Count Icon 8984
  • 10.1007/978-3-030-00889-5_1
UNet++: A Nested U-Net Architecture for Medical Image Segmentation.
  • Jan 1, 2018
  • Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support : 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, S...
  • Zongwei Zhou + 3 more

In this paper, we present UNet++, a new, more powerful architecture for medical image segmentation. Our architecture is essentially a deeply-supervised encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways. The re-designed skip pathways aim at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks. We argue that the optimizer would deal with an easier learning task when the feature maps from the decoder and encoder networks are semantically similar. We have evaluated UNet++ in comparison with U-Net and wide U-Net architectures across multiple medical image segmentation tasks: nodule segmentation in the low-dose CT scans of chest, nuclei segmentation in the microscopy images, liver segmentation in abdominal CT scans, and polyp segmentation in colonoscopy videos. Our experiments demonstrate that UNet++ with deep supervision achieves an average IoU gain of 3.9 and 3.4 points over U-Net and wide U-Net, respectively.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-73689-7_29
Delving into Feature Maps: An Explanatory Analysis to Evaluate Weight Initialization
  • Jan 1, 2021
  • Meenal Narkhede + 2 more

Convolutional neural networks have delivered exceptional performance in various areas of computer vision. There has been growing research to develop deeper architectures with the availability of large datasets. Training such deep networks on large datasets is a tedious process as it involves optimizing a loss function by updating the parameters of the network. Weight initialization is a vital step before training neural networks as the correct choice of network weights ensures that the optimization converges to global minima in the least time. The weight initialization strategies in the literature can be categorized as (1) Initialization without pre-training, and (2) Initialization with pre-training. This paper presents a comparative analysis of the convergence performance of some widely used weight initialization techniques in these categories. This analysis is based on the diversity insights measured in terms of mean standard deviation captured from the feature maps. The experimentation has been carried out by training the AlexNet and VGG16 network on CIFAR-10 and CIFAR-100 datasets. The experimentation results demonstrate that the He initialization technique, which shows the best convergence performance among the others considered for the study, leads the training process such that the diversity of feature maps increases with epochs for both AlexNet and VGG16 network.

  • Conference Article
  • Cite Count Icon 194
  • 10.1109/cvpr.2019.01191
Customizable Architecture Search for Semantic Segmentation
  • Jun 1, 2019
  • Yiheng Zhang + 5 more

In this paper, we propose a Customizable Architecture Search (CAS) approach to automatically generate a network architecture for semantic image segmentation. The generated network consists of a sequence of stacked computation cells. A computation cell is represented as a directed acyclic graph, in which each node is a hidden representation (i.e., feature map) and each edge is associated with an operation (e.g., convolution and pooling), which transforms data to a new layer. During the training, the CAS algorithm explores the search space for an optimized computation cell to build a network. The cells of the same type share one architecture but with different weights. In real applications, however, an optimization may need to be conducted under some constraints such as GPU time and model size. To this end, a cost corresponding to the constraint will be assigned to each operation. When an operation is selected during the search, its associated cost will be added to the objective. As a result, our CAS is able to search an optimized architecture with customized constraints. The approach has been thoroughly evaluated on Cityscapes and CamVid datasets, and demonstrates superior performance over several state-of-the-art techniques. More remarkably, our CAS achieves 72.3% mIoU on the Cityscapes dataset with speed of 108 FPS on an Nvidia TitanXp GPU.

  • Conference Article
  • Cite Count Icon 168
  • 10.1109/cvpr.2018.00690
Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation
  • Jun 1, 2018
  • Piotr Bilinski + 1 more

We propose a novel end-to-end trainable, deep, encoder-decoder architecture for single-pass semantic segmentation. Our approach is based on a cascaded architecture with feature-level long-range skip connections. The encoder incorporates the structure of ResNeXt's residual building blocks and adopts the strategy of repeating a building block that aggregates a set of transformations with the same topology. The decoder features a novel architecture, consisting of blocks, that (i) capture context information, (ii) generate semantic features, and (iii) enable fusion between different output resolutions. Crucially, we introduce dense decoder shortcut connections to allow decoder blocks to use semantic feature maps from all previous decoder levels, i.e. from all higher-level feature maps. The dense decoder connections allow for effective information propagation from one decoder block to another, as well as for multi-level feature fusion that significantly improves the accuracy. Importantly, these connections allow our method to obtain state-of-the-art performance on several challenging datasets, without the need of time-consuming multi-scale averaging of previous works.

  • Research Article
  • 10.1049/ipr2.70204
SPWS‐Transformer: A Study of 3D Target Detection Method Based on Lightweight Depth Prediction With Multi‐Scale Fusion
  • Jan 1, 2025
  • IET Image Processing
  • Chang'An Zhang + 4 more

Advanced driver assistance systems (ADAS) mainly consist of three components: environmental perception, decision planning, and motion control. As a fundamental component of the ADAS environmental perception system, 3D object detection enables vehicles to avoid obstacles and ensure driving safety only through accurate and real‐time prediction and localization of three‐dimensional targets such as vehicles and pedestrians in road scenes. Therefore, to improve both the real‐time performance and accuracy of 3D object detection, we propose a lightweight depth prediction‐based 3D object detection model with multi‐scale fusion—SPWS‐Transformer. First, to enhance the model's accuracy, we propose a feature extraction network incorporating multi‐scale feature fusion and depth prediction. By designing a multi‐scale feature fusion module, we effectively combine multi‐scale semantic and fine‐grained information from feature maps of different scales to enhance the network's feature extraction capability. To capture spatial information from the feature maps, we apply convolution, group normalization, and nonlinear activation operations on the fused feature maps to generate depth feature maps. Both the fused feature maps and depth feature maps serve as inputs for subsequent network stages. To further improve accuracy, we leverage the long‐range modelling advantages of Transformers by designing a feature enhancement encoder to strengthen the representation capability of depth feature maps. We incorporate a dilated encoder to perform positional encoding on depth feature maps and utilize multi‐head self‐attention mechanisms to capture contextual relationships within the input scene, thereby enhancing the detection capability of the 3D object detection network. Then, to improve real‐time performance, we design a decoder structure with scale‐aware attention. By predefining masks of different scales, we adaptively learn a scale‐aware filter using depth and visual features to enhance object queries. Finally, on the KITTI dataset, the improved algorithm achieves an AP of 24.66% for the car category, with more significant improvements in detection accuracy under the ‘hard’ difficulty level. The model achieves an inference time of 24 ms.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.5194/isprs-archives-xlii-5-621-2018
ORTHOSEG: A DEEP MULTIMODAL CONVOLUTONAL NEURAL NETWORK ARCHITECTURE FOR SEMANTIC SEGMENTATION OF ORTHOIMAGERY
  • Nov 19, 2018
  • The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
  • P. Bodani + 2 more

Abstract. This paper addresses the task of semantic segmentation of orthoimagery using multimodal data e.g. optical RGB, infrared and digital surface model. We propose a deep convolutional neural network architecture termed OrthoSeg for semantic segmentation using multimodal, orthorectified and coregistered data. We also propose a training procedure for supervised training of OrthoSeg. The training procedure complements the inherent architectural characteristics of OrthoSeg for preventing complex co-adaptations of learned features, which may arise due to probable high dimensionality and spatial correlation in multimodal and/or multispectral coregistered data. OrthoSeg consists of parallel encoding networks for independent encoding of multimodal feature maps and a decoder designed for efficiently fusing independently encoded multimodal feature maps. A softmax layer at the end of the network uses the features generated by the decoder for pixel-wise classification. The decoder fuses feature maps from the parallel encoders locally as well as contextually at multiple scales to generate per-pixel feature maps for final pixel-wise classification resulting in segmented output. We experimentally show the merits of OrthoSeg by demonstrating state-of-the-art accuracy on the ISPRS Potsdam 2D Semantic Segmentation dataset. Adaptability is one of the key motivations behind OrthoSeg so that it serves as a useful architectural option for a wide range of problems involving the task of semantic segmentation of coregistered multimodal and/or multispectral imagery. Hence, OrthoSeg is designed to enable independent scaling of parallel encoder networks and decoder network to better match application requirements, such as the number of input channels, the effective field-of-view, and model capacity.

  • Research Article
  • Cite Count Icon 30
  • 10.1109/tgrs.2022.3157721
Multilevel Deformable Attention-Aggregated Networks for Change Detection in Bitemporal Remote Sensing Imagery
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Xiaokang Zhang + 2 more

Deep learning (DL) approaches based on convolutional encoder–decoder networks have shown promising results in bitemporal change detection. However, their performance is limited by insufficient contextual information aggregation because they cannot fully capture the implicit contextual dependency relationships among feature maps at different levels. Moreover, harvesting long-range contextual information typically incurs high computational complexity. To circumvent these challenges, we propose multilevel deformable attention-aggregated networks (MLDANets) to effectively learn long-range dependencies across multiple levels of bitemporal convolutional features for multiscale context aggregation. Specifically, a multilevel change-aware deformable attention (MCDA) module consisting of linear projections with learnable parameters is built based on multihead self-attention (SA) with a deformable sampling strategy. It is applied in the skip connections of an encoder–decoder network taking a bitemporal deep feature hypersequence (BDFH) as input. MCDA can progressively address a set of informative sampling locations in multilevel feature maps for each query element in the BDFH. Simultaneously, MCDA learns to characterize beneficial information from different spatial and feature subspaces of BDFH using multiple attention heads for change perception. As a result, contextual dependencies across multiple levels of bitemporal feature maps can be adaptively aggregated via attention weights to generate multilevel discriminative change-aware representations. Experiments on very-high-resolution (VHR) datasets verify that MLDANets outperform state-of-the-art change detection approaches with dramatically faster training convergence and high computational efficiency.

  • Conference Article
  • 10.1109/icnisc57059.2022.00018
End-to-End Object Detection with Location-Sensitive Cues
  • Sep 1, 2022
  • Chunzhe Wang + 5 more

A successful object detector should be consistent with the predicted results of human vision systems. However, object detectors can't accurately reflect the human perception of objects in the image. We develop an object detection algorithm based on spatial attention mechanism, combining with location-sensitive cues of objects. Firstly, the feature maps, which exactly describe objects at multiple scales, are extracted from the image to be detected using convolutional neural networks(CNNs). Secondly, an appropriate scale feature map is selected as the candidate feature map and a spatial attention model is adopted to obtain its weight matrix, then the feature map, which combines object saliency information, are generated using the candidate feature map and its weight matrix through mathematical modeling. Finally, the candidate feature map and other feature maps at different scales are used to predict objects through classification and regression strategies on CNNs. The experiment results demonstrate the proposed algorithm has good performance on PASCAL VOC 2007 and PASCAL VOC 2012. The mean average precision (mAP) of proposed algorithm is higher than that of attention-based object detection approaches, which indicates the proposed algorithm can well perceive the objects in the image.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant