<p><span lang="EN-GB">The identification and recognition of text from video frames have received a lot of attention recently, that makes many computer vision-based applications conceivable. In this study, we modify the picture mask and the original identification of the mask region convolution neural network and permit detection in three levels, including holistic, sequence, and at the level of pixels. To identify the texts and determine the text forms, semantics at the pixel and holistic levels can be used. With masking and detection, existences of the character and the word are separated and recognised. In addition, text detection using the results of 2-D feature space instance segmentation is done. Moreover, we explore text recognition using an attention-based optical character recognition (OCR) method with mask</span><span lang="EN-US"> r</span><span lang="EN-GB">egion convolution neural networks (R-CNN) to address and detect the problem of smaller and blurrier texts at the sequential level. Using attribute maps of the word occurrences in sequence to seq, the OCR method calculates the character sequence. At last, a fine-grained learning strategy is proposed to constructs models at word level using the annotated datasets, resulting in the training of a more precise and reliable model. The well-known benchmark datasets ICDAR 2013 and ICDAR 2015 are used to test our suggested methodology.</span></p>
Read full abstract