Abstract
Boundary pixel blur and category imbalance are common problems that occur during semantic segmentation of urban remote sensing images. Inspired by DenseU-Net, this paper proposes a new end-to-end network—SiameseDenseU-Net. First, the network simultaneously uses both true orthophoto (TOP) images and their corresponding normalized digital surface model (nDSM) as the input of the network structure. The deep image features are extracted in parallel by downsampling blocks. Information such as shallow textures and high-level abstract semantic features are fused throughout the connected channels. The features extracted by the two parallel processing chains are then fused. Finally, a softmax layer is used to perform prediction to generate dense label maps. Experiments on the Vaihingen dataset show that SiameseDenseU-Net improves the F1-score by 8.2% and 7.63% compared with the Hourglass-ShapeNetwork (HSN) model and with the U-Net model. Regarding the boundary pixels, when using the same focus loss function based on median frequency balance weighting, compared with the original DenseU-Net, the small-target “car” category F1-score of SiameseDenseU-Net improved by 0.92%. The overall accuracy and the average F1-score also improved to varying degrees. The proposed SiameseDenseU-Net is better at identifying small-target categories and boundary pixels, and it is numerically and visually superior to the contrast model.
Highlights
In the computer vision field, semantic segmentation is an important issue
We further explore the potential of convolutional neural network (CNN) for end-to-end semantic segmentation of high-resolution remote sensing images
SiameseDenseU-Net uses two similar parallel DenseU-Nets, each of which is composed of an encoder and a decoder. e encoder consists of five consecutive sets of downsampled blocks that double the number of feature dimensions, while the decoder consists of five consecutive sets of upsampling blocks that halve the number of feature dimensions. e input feature extracts the context information through the downsampling block to obtain a hierarchical feature and recovers the resolution of the extracted features via the upsampling block, restoring the spatial position information lost by the encoder
Summary
In the computer vision field, semantic segmentation is an important issue. Image semantic segmentation aims to determine the most proposed class label for each pixel in an image drawn from a set of predefined limited labels. In 2012, the AlexNet network, proposed by Krizhevsky et al [1], caused a new upsurge in imaging applications in the field of deep learning. Tsogkas and Kokkinos [2] combined a convolutional neural network (CNN) with a fully connected conditional random field (CRF) approach to learn the lost prior information. Long et al [5] proposed a fully convolutional network (FCN) to classify images at the pixel level. Unlike the classic CNN, FCN can accept an input image of any size and restore it to the same size as the input
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.