Abstract

Text extraction is the key step in the character recognition; its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.

Highlights

  • Text extraction is the key step in the character recognition; its accuracy highly relies on the location of the text region

  • We propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction

  • 1) First Layer Judge (FLJ) Reading 200 text images, and using the above method to gain a lot of connected component regions, by observing the regions difference between text regions and non-text regions, we find that regions whose row or column is less than 22, or the value for row divide column is more than 10 or less than 0.1 generally would be non-text regions

Read more

Summary

Introduction

Because there are so many possible sources of variation when extracting text from a shaded or textured background, from low-contrast or complex images, or from images having variations in font size, style, color, orientation and alignment. [4], Wei Fan presented a novel text segmentation method which is independent of variations in text font style, size, intensity, and polarity, and of string orientation with separating the pixels of a document image into four categories: “dark text/lines”, “bright text/ lines”, “dark figure/graphics” and “white background” This method is only valid for text embedded in a simple background. In order to solve the problem of text recognition in low-contrast colorful images, we proposed a new method for positioning text region automatically This method includes the image color space transform, edge detection, image enhancement, morphological, connected component, SVM, etc.

Text Region Positioning
Step 1
Step 2
Step 3
Step 4
Text Extraction and Recognition
Evaluation
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.