Abstract

Using light-weight architectures or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To this end, we proposed a family of deep dual-resolution networks (DDRNets) for real-time and accurate semantic segmentation, which consist of deep dual-resolution backbones and enhanced low-resolution contextual information extractors. The two deep branches and multiple bilateral fusions of backbones generate higher quality details compared to existing two-pathway methods. The enhanced contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) enlarges effective receptive fields and fuses multi-scale context based on low-resolution feature maps with little time cost. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. For the input of full resolution, on a single 2080Ti GPU without hardware acceleration, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/ydhongHIT/DDRNet</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call