Abstract
Downsampling input images is a simple trick to speed up visual object-detection algorithms, especially on robotic vision and applied mobile vision systems. However, this trick comes with a significant decline in accuracy. In this paper, dual-resolution dual-path Convolutional Neural Networks (CNNs), named DualNets, are proposed to bump up the accuracy of those detection applications. In contrast to previous methods that simply downsample the input images, DualNets explicitly take dual inputs in different resolutions and extract complementary visual features from these using dual CNN paths. The two paths in a DualNet are a backbone path and an auxiliary path that accepts larger inputs and then rapidly downsamples them to relatively small feature maps. With the help of the carefully designed auxiliary CNN paths in DualNets, auxiliary features are extracted from the larger input with controllable computation. Auxiliary features are then fused with the backbone features using a proposed progressive residual fusion strategy to enrich feature representation.This architecture, as the feature extractor, is further integrated with the Single Shot Detector (SSD) to accomplish latency-sensitive visual object-detection tasks. We evaluate the resulting detection pipeline on Pascal VOC and MS COCO benchmarks. Results show that the proposed DualNets can raise the accuracy of those CNN detection applications that are sensitive to computation payloads.
Highlights
In robotic applications, there is a trend of integrating robotics with human beings and their environments
Results show that the proposed DualNets can raise the accuracy of those Convolutional Neural Networks (CNNs) detection applications that are sensitive to computation payloads
To diminish existing performance gaps, in this paper we propose DualNets, dual-resolution dual-path CNNs, to bump up the accuracy of object-detection applications that are sensitive to computation payloads such as those deployed on embedded devices
Summary
There is a trend of integrating robotics with human beings and their environments. Because computation becomes squared if one merely doubles the input width, it is not feasible to accept large inputs on all applied systems To address this problem, efficient CNN models were proposed for embedded devices and have achieved high inference speed with a non-negligible accuracy drop [10,11,12]. Depthwise separable convolutions were proposed in MobileNets [10] to bring down the computational cost These mobile-oriented CNN architectures, as feature extractors, are used in conjunction with the detectors mentioned above, resulting in CNN-based object-detection pipelines with high inference speed but limited detection accuracy. Auxiliary features are extracted from the large inputs by the designed auxiliary path, which has fewer stacked layers to reduce computational costs Fusing those complementary features from both paths with a progressive fusion strategy helps to improve the detection results. Applying the fusion strategy on complementary features extracted by the dual paths, DualNets can raise the accuracy of mobile-oriented CNN detectors
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.