Abstract

How to properly involve text characteristics like multi-scale, arbitrary direction, length aspect ratio, into detection network design has become a hot topic in computer vision. Feature Pyramid Network (FPN) is a typical method to achieve robust text detection, where its low-level and high-level feature map retains spatial structure and global semantic information, respectively. However, its strict hierarchical structure fails to fuse low-level and high-level information to improve distinguish ability of feature map. To address this problem, we propose a novel feature fusion pyramid network for end-to-end scene text detection by fusing multi-modal information. By diving pyramid structure into high-level and low-level layers, channel and spatial attention modules are adopted to enhance high-level and low-level feature representation by encoding channel and spatial -wise context information, respectively. In order to reduce information loss by layer transmission, a special residual network is designed to achieve short-cut between high-level and low-level features, so as to realize multi-modal feature fusion. Experiments show the precision and recall of the propose method on ICDAR2015, ICDAR2017-MLT and MSRA-TD500 datasets reach 88.7%/82.1%, 77.0%/60.3% and 85.3%/74.8%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call