ABSTRACT Text detection in natural images is a challenging problem due to variations in text size, aspect ratio, alignment and background complexity. This paper proposes a multiscale feature fusion convolutional neural network method to detect cursive and multi-language text in natural images. The proposed method combines VGG-16 features at multi-scales and multi-layers and creates a new convolutional feature map of shallow and deep layers. On top of convolutional feature map, a vertical text proposal generation method is used that generates fixed-size text proposals. A recurrent layer is implemented which takes the convolutional feature maps of window as sequential input and updates the recurrent state internally in the hidden layers. The output of recurrent layer is mapped to the two fully connected layers to predict the text/non-text region proposals and bounding boxes regression. The model is evaluated on a custom-developed Urdu scene text dataset and the ICDAR-MLT17 Arabic text image dataset.
Read full abstract