Abstract
With the rapid development of Internet technology, breakthroughs have been made in all branches of computer vision. Especially in image detection and target tracking, deep learning techniques such as convolutional neural networks have achieved excellent results. In order to explore the applicability of machine learning technology in the field of video text recognition and extraction, a YOLOv3 network based on multiscale feature transformation and migration fusion is proposed to improve the accuracy of english text detection in natural scenes in video. Firstly, aiming at the problem of multiscale target detection in video key frames, based on the YOLOv3 network, the scale conversion module of STDN algorithm is used to reduce the low-level feature map, and a backbone network with feature reuse is constructed to extract features. Then, the scale conversion module is used to enlarge the high-level feature map, and a feature pyramid network (FPN) is built to predict the target. Finally, the improved YOLOv3 network is verified to extract key text from images. The experimental results show that the improved YOLOv3 network can effectively improve the false detection and missed detection caused by occlusion and small target, and the accuracy of English text extraction is obviously improved.
Highlights
As the mainstream part of today’s media industry, images and videos are rich in information and easy to understand, which makes them an indispensable part of life
Conventional object detection methods in the visual field (SSD, You Only Look Once (YOLO), faster-RCNN, etc.) are not ideal when directly applied to English text detection tasks. e main reasons are as follows: compared with conventional objects, the length of text lines and the ratio of length to width vary widely. erefore, after analyzing YOLO series networks, we propose a new multiscale feature fusion method, which improves the performance of YOLOv3 networks
In order to construct features with multiscale characteristics and rich expressive ability, we introduce a feature scale transformation and migration fusion method to improve the traditional YOLOv3 network
Summary
As the mainstream part of today’s media industry, images and videos are rich in information and easy to understand, which makes them an indispensable part of life. Character recognition has great application value in many scenes, such as vehicle license plate detection, image-text conversion, image content translation, and image search. Because the precision of text recognition technology is not ideal, its application scenarios are relatively simple, such as content search in images [1,2,3,4,5,6]. The characters are easy to extract and have strong descriptive ability, so how to understand the semantic information of characters in images is an urgent problem to be solved. E final result of the recognition system can only be determined if the images and characters have good detection performance, so the text detection is the key research content of this paper One kind of text is the text in the natural scene of the image [13,14,15], such as the license plate number and bus stop sign text in the image [16], and the other kind of text is artificially added, such as movie subtitles, advertising information, and medical image analysis text [17]. erefore, all the words should be extracted except the repeated words within a short time delay. e final result of the recognition system can only be determined if the images and characters have good detection performance, so the text detection is the key research content of this paper
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have