Text Line Images Research Articles

Optical Character Recognition is considered one of the fastest methods of data entry. OCR converts the text image representation of x and y coordinates representing pixel information to be converted into text data in a particular language. OCR as a field of pattern recognition and document image understanding, OCR requires a challenging job once a different language text is available on the image. Difference in language script will pose different challenges for OCR which requires entirely different approaches and algorithms. Latin scripts require a different approach whereas the Balochi adopted language scripts require a different approach. In this regard, various solutions have been proposed for different languages. Segmentation is considered one of the important tasks in the process of OCR. A good segmentation will definitely increase the accuracy of an OCR. Segmentation includes the segmentation of text lines from text images which are further divided into words. These segmented words are further divided into characters which are to be recognized. A single segmentation algorithm to segment various scripts of the languages is proposed in this study which checks the script and then segments the text image for the further processing in OCR. The proposed generalized algorithm will check the style, direction and other properties of the script and then adopts the segmentation process to segment text lines, words and characters of the language. The proposed algorithm segments more than ten languages of three scripts and segments for their OCRs. These images can be further processed for feature extraction and classification further. The process of OCR for selected languages will be made easier to recognize. Multiple scripts, languages and images were experimented, and the proposed algorithm successfully segmented 42,833 images of text line, words and character image. The algorithm provides 97% accuracy while segmenting these images and can be extended to further languages as well as scripts .

Read full abstract

Optical Character Recognition is considered one of the fastest methods of data entry. OCR converts the text image representation of x and y coordinates representing pixel information to be converted into text data in a particular language. OCR as a field of pattern recognition and document image understanding, OCR requires a challenging job once a different language text is available on the image. Difference in language script will pose different challenges for OCR which requires entirely different approaches and algorithms. Latin scripts require a different approach whereas the Arabic adopted language scripts require a different approach. In this regard, various solutions have been proposed for different languages. Segmentation is considered one of the important tasks in the process of OCR. A good segmentation will definitely increase the accuracy of an OCR. Segmentation includes the segmentation of text lines from text images which are further divided into words. These segmented words are further divided into characters which are to be recognized. A single segmentation algorithm to segment various scripts of the languages is proposed in this study which checks the script and then segments the text image for the further processing in OCR. The proposed generalized algorithm will check the style, direction and other properties of the script and then adopts the segmentation process to segment text lines, words and characters of the language. The proposed algorithm segments more than ten languages of three scripts and segments for their OCRs. These images can be further processed for feature extraction and classification further. The process of OCR for selected languages will be made easier to recognize. Multiple scripts, languages and images were experimented, and the proposed algorithm successfully segmented 32,833 images of text line, words and character image. The algorithm provides 97% accuracy while segmenting these images and can be extended to further languages as well as scripts.

Read full abstract

Text Line Images Research Articles

Related Topics

Articles published on Text Line Images

Optical Character Recognition of Balochi Script

Generalized Segmentation Algorithm for Dissimilar Script Languages

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Discrete representation learning for handwritten text recognition

Content and Style Aware Generation of Text-Line Images for Handwriting Recognition.

Textline alignment on the image domain

PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition

Few shots are all you need: A progressive learning approach for low resource handwritten text recognition

Attention-based CNN-ConvLSTM for Handwritten Arabic Word Extraction

New Deep Spatio-Structural Features of Handwritten Text Lines for Document Age Classification

Khmer printed character recognition using attention-based Seq2Seq network

PHTI: Pashto Handwritten Text Imagebase for Deep Learning Applications

Text Line Recognition of Dai Language using Statistical Characteristics of Texture Analysis and Deep Gaussian Process

Zone-based keyword spotting in Bangla and Devanagari documents

Amharic OCR: An End-to-End Learning

Clustering-based word segmentation from off-line handwritten Uyghur text-line images

Clustering-based word segmentation from off-line handwritten Uyghur text-line images

Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network

Text baseline detection, a single page trained system

SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Line Images Research Articles

Related Topics

Articles published on Text Line Images

Optical Character Recognition of Balochi Script

Generalized Segmentation Algorithm for Dissimilar Script Languages

Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units

Discrete representation learning for handwritten text recognition

Content and Style Aware Generation of Text-Line Images for Handwriting Recognition.

Textline alignment on the image domain

PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition

Few shots are all you need: A progressive learning approach for low resource handwritten text recognition

Attention-based CNN-ConvLSTM for Handwritten Arabic Word Extraction

New Deep Spatio-Structural Features of Handwritten Text Lines for Document Age Classification

Khmer printed character recognition using attention-based Seq2Seq network

PHTI: Pashto Handwritten Text Imagebase for Deep Learning Applications

Text Line Recognition of Dai Language using Statistical Characteristics of Texture Analysis and Deep Gaussian Process

Zone-based keyword spotting in Bangla and Devanagari documents

Amharic OCR: An End-to-End Learning

Clustering-based word segmentation from off-line handwritten Uyghur text-line images

Clustering-based word segmentation from off-line handwritten Uyghur text-line images

Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network

Text baseline detection, a single page trained system

SCUT-EPT: New Dataset and Benchmark for Offline Chinese Text Recognition in Examination Paper