The context information of images had been lost due to the low resolution of features, and due to repeated combinations of max-pooling layer and down-sampling layer. When the feature extraction process had been performed using a convolutional network, the result of semantic image segmentation loses sensitivity to the location of the object. The semantic image segmentation based on a feature fusion model with context features layer-by-layer had been proposed. Firstly, the original images had been pre-processed by the Gaussian Kernel Function to generate a series of images with different resolutions to form an image pyramid. Secondly, inputting an image pyramid into the network structure in which the plurality of fully convolutional network was been combined in parallel to obtain a set of initial features with different granularities by expanding receptive fields using Atrous Convolutions, and the initialization of feature fusion with different layer-by-layer granularities in a top-down method. Finally, the score map of feature fusion model had been calculated and sent to the conditional random field, modeling the class correlations between image pixels of the original image by the fully connected conditional random field, and the spatial position information and color vector information of image pixels were jointed to optimize and obtain results. The experiments on the PASCAL VOC 2012 and PASCAL Context datasets had achieved better mean Intersection Over Union than the state-of-the-art works. The proposed method has about 6.3% improved to the conventional methods.
Read full abstract