Abstract

Downsampling input images is a simple trick to speed up visual object-detection algorithms, especially on robotic vision and applied mobile vision systems. However, this trick comes with a significant decline in accuracy. In this paper, dual-resolution dual-path Convolutional Neural Networks (CNNs), named DualNets, are proposed to bump up the accuracy of those detection applications. In contrast to previous methods that simply downsample the input images, DualNets explicitly take dual inputs in different resolutions and extract complementary visual features from these using dual CNN paths. The two paths in a DualNet are a backbone path and an auxiliary path that accepts larger inputs and then rapidly downsamples them to relatively small feature maps. With the help of the carefully designed auxiliary CNN paths in DualNets, auxiliary features are extracted from the larger input with controllable computation. Auxiliary features are then fused with the backbone features using a proposed progressive residual fusion strategy to enrich feature representation.This architecture, as the feature extractor, is further integrated with the Single Shot Detector (SSD) to accomplish latency-sensitive visual object-detection tasks. We evaluate the resulting detection pipeline on Pascal VOC and MS COCO benchmarks. Results show that the proposed DualNets can raise the accuracy of those CNN detection applications that are sensitive to computation payloads.

Highlights

  • In robotic applications, there is a trend of integrating robotics with human beings and their environments

  • Results show that the proposed DualNets can raise the accuracy of those Convolutional Neural Networks (CNNs) detection applications that are sensitive to computation payloads

  • To diminish existing performance gaps, in this paper we propose DualNets, dual-resolution dual-path CNNs, to bump up the accuracy of object-detection applications that are sensitive to computation payloads such as those deployed on embedded devices

Read more

Summary

Introduction

There is a trend of integrating robotics with human beings and their environments. Because computation becomes squared if one merely doubles the input width, it is not feasible to accept large inputs on all applied systems To address this problem, efficient CNN models were proposed for embedded devices and have achieved high inference speed with a non-negligible accuracy drop [10,11,12]. Depthwise separable convolutions were proposed in MobileNets [10] to bring down the computational cost These mobile-oriented CNN architectures, as feature extractors, are used in conjunction with the detectors mentioned above, resulting in CNN-based object-detection pipelines with high inference speed but limited detection accuracy. Auxiliary features are extracted from the large inputs by the designed auxiliary path, which has fewer stacked layers to reduce computational costs Fusing those complementary features from both paths with a progressive fusion strategy helps to improve the detection results. Applying the fusion strategy on complementary features extracted by the dual paths, DualNets can raise the accuracy of mobile-oriented CNN detectors

Related Work
CNN-Based Object Detection
Fast Inference Using Small CNN Models
Dual-Path Models
DualNets
Brief Review of MobileNets and SSD
Dual Inputs and Dual Paths
Progressive Residual Fusion
Experiments
Ablation Study on Dualnet-300
Weight Sharing
Initializing from a Pretrained Model
Fusion Strategy
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call