Abstract

With the influence of diverse architectures like ImageNet, VGGNet, ResNet for detection of objects in images, we are proposing a novel architecture for detection of text in video. It is challenging to detect text candidates due to its nature of properties that varies from normal objects in terms of contours, connectionist, size, scaling to motion occlusion, color contrast, poor illumination, etc. Also, it is not possible to apply the existing architecture for the proposed anatomy with incompatibility in targets, parameters. Hence, working on video takes different path of learning and validation. The proposed architecture reads the temporal data to train the sequence of learning features. These features are fed to periodic connectionist to learn successive features to obtain the text candidate. Later, representation of the features are fed to regional proposal network to obtain the regions of interest by comparing with the ground-truth data followed by pooling the text regions with bounding box and finding the probability of their occurrence. The proposed structure evaluated on an ICDAR 2013 “Text in Video” dataset of different indoor and outdoor videos achieves high detection rates and performed better than labeled features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call