Dense Prediction Tasks Research Articles

Although fully convolution networks (FCNs) have dominated dense prediction tasks (e.g., semantic segmentation, depth estimation and object detection) for decades, they are inherently limited in capturing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate dense prediction results by over-smoothing the regions containing fine details (e.g., boundaries and small objects). To this end, we aim to provide an alternative perspective by rethinking local and global feature representation for the dense prediction task. Specifically, we deploy a Dual-Stream Convolution-Transformer architecture, called DSCT, by taking advantage of both the convolution and Transformer to learn a rich feature representation, combining with a task decoder to provide a powerful dense prediction model. DSCT extracts high resolution local feature representation from convolution layers and global feature representation from Transformer layers. With the local and global context modeled explicitly in every layer, the two streams can be combined with a decoder to perform task of semantic segmentation, monocular depth estimation or object detection. Extensive experiments show that DSCT can achieve superior performance on the three tasks above. For semantic segmentation, DSCT builds a new state of the art on Cityscapes validation set (83.31% mIoU) with only 80,000 training iterations and appealing performance (49.27% mIoU) on ADE20K validation set, outperforming most of the alternatives. For monocular depth estimation, our model achieves 2.423 RMSE on KITTI Eigen split, superior to most of the convolution or Transformer counterparts. For object detection, without using FPN, we can achieve 44.5% APb on COCO dataset when using Faster R-CNN, which is higher than Conformer.

Read full abstract

Recently proposed improvements in the field of Computer Vision refer to enhancing the feature processing capabilities of Single-Task Convolutional Neural Networks. A typical Single-Task network consists of a backbone and a head, where the feature extractor is usually optimised using the gradient provided by the head. Inevitably, the backbone specialises for the given task. This sort of approach does not scale well for learning multiple tasks at once while having the same input. As a response, there is an increasing interest in Multi-Task formulations. Since most Multi-Task architectures employ a single shared backbone, when gradients from different tasks are propagated back to it, it can result in its oversaturation. Thus, this problem may be solved using Multi-Backbone feature extractors. Hence, as a strategy proposed to compensate for these shortcomings, we introduce <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MBMT-Net</i> , a Multi-Backbone-Multi-Task-Network architecture based on a development strategy that infuses backbones with more diverse and specialised processing capabilities. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MBMT-Net</i> consists of parallel pre-trained backbones whose outputs are concatenated and offered to the Multi-Task heads that shall benefit from richer and more diverse features with decreased number of network parameters when compared to traditional Multi-Task architectures. Our strategy is architecture independent, and it can be applied to different types of backbones and parsing heads, which greatly extends the domain of configurable features, finally enhancing existing Single- and Multi-Task model building strategies and outperforming them when using the Multi-Backbone design. As a result, while having a deficit of 12.16M parameters, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MBMT-Net</i> reaches state-of-the-art performances and surpasses the previously best semantic segmentation Multi-Task model in terms of Mean Intersection over Union when evaluated on NYUv2 data set.

Read full abstract

Dense Prediction Tasks Research Articles

Related Topics

Articles published on Dense Prediction Tasks

Tripartite Feature Enhanced Pyramid Network for Dense Prediction.

Changer: Feature Interaction is What You Need for Change Detection

Adversarial Dense Contrastive Learning for Semi-supervised Semantic Segmentation.

Retain and Recover: Delving into Information Loss for Few-Shot Segmentation.

Centralized Feature Pyramid for Object Detection.

LPCL: Localized prominence contrastive learning for self-supervised dense visual pre-training

Rethinking Local and Global Feature Representation for Dense Prediction

Cattle Segmentation and Contour Detection Based on Solo for Precision Livestock Husbandry

DenseCL: A simple framework for self-supervised dense visual pre-training

Vision transformers for dense prediction: A survey

Less Is More: Pay Less Attention in Vision Transformers

How Useful Is Image-Based Active Learning for Plant Organ Segmentation?

Index Networks.

MBMT-Net: A Multi-Task Learning Based Convolutional Neural Network Architecture for Dense Prediction Tasks

Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations

A 3-D-Swin Transformer-Based Hierarchical Contrastive Learning Method for Hyperspectral Image Classification

STransUNet: A Siamese TransUNet-Based Remote Sensing Image Change Detection Network

Self-Supervised SAR-Optical Data Fusion of Sentinel-1/-2 Images

Smoothed dilated convolutions for improved dense prediction

Polynomial approximation based spectral dual graph convolution for scene parsing and segmentation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Dense Prediction Tasks Research Articles

Related Topics

Articles published on Dense Prediction Tasks

Tripartite Feature Enhanced Pyramid Network for Dense Prediction.

Changer: Feature Interaction is What You Need for Change Detection

Adversarial Dense Contrastive Learning for Semi-supervised Semantic Segmentation.

Retain and Recover: Delving into Information Loss for Few-Shot Segmentation.

Centralized Feature Pyramid for Object Detection.

LPCL: Localized prominence contrastive learning for self-supervised dense visual pre-training

Rethinking Local and Global Feature Representation for Dense Prediction

Cattle Segmentation and Contour Detection Based on Solo for Precision Livestock Husbandry

DenseCL: A simple framework for self-supervised dense visual pre-training

Vision transformers for dense prediction: A survey

Less Is More: Pay Less Attention in Vision Transformers

How Useful Is Image-Based Active Learning for Plant Organ Segmentation?

Index Networks.

MBMT-Net: A Multi-Task Learning Based Convolutional Neural Network Architecture for Dense Prediction Tasks

Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations

A 3-D-Swin Transformer-Based Hierarchical Contrastive Learning Method for Hyperspectral Image Classification

STransUNet: A Siamese TransUNet-Based Remote Sensing Image Change Detection Network

Self-Supervised SAR-Optical Data Fusion of Sentinel-1/-2 Images

Smoothed dilated convolutions for improved dense prediction

Polynomial approximation based spectral dual graph convolution for scene parsing and segmentation