Text extraction is the key step in the character recognition; its accuracy highly relies on the location of the text region. In this paper, we propose a new method which can find the text location automatically to solve some regional problems such as incomplete, false position or orientation deviation occurred in the low-contrast image text extraction. Firstly, we make some pre-processing for the original image, including color space transform, contrast-limited adaptive histogram equalization, Sobel edge detector, morphological method and eight neighborhood processing method (ENPM) etc., to provide some results to compare the different methods. Secondly, we use the connected component analysis (CCA) method to get several connected parts and non-connected parts, then use the morphology method and CCA again for the non-connected part to erode some noises, obtain another connected and non-connected parts. Thirdly, we compute the edge feature for all connected areas, combine Support Vector Machine (SVM) to classify the real text region, obtain the text location coordinates. Finally, we use the text region coordinate to extract the block including the text, then binarize, cluster and recognize all text information. At last, we calculate the precision rate and recall rate to evaluate the method for more than 200 images. The experiments show that the method we proposed is robust for low-contrast text images with the variations in font size and font color, different language, gloomy environment, etc.
Read full abstract