Abstract

Document segmentation is a method of rending the document into distinct regions. A document is an assortment of information and a standard mode of conveying information to others. Pursuance of data from documents involves ton of human effort, time intense and might severely prohibit the usage of data systems. So, automatic information pursuance from the document has become a big issue. It is been shown that document segmentation will facilitate to beat such problems. This paper proposes a new approach to segment and classify the document regions as text, image, drawings and table. Document image is divided into blocks using Run length smearing rule and features are extracted from every blocks. Discipulus tool has been used to construct the Genetic programming based classifier model and located 97.5% classification accuracy.

Highlights

  • Document segmentation is defined as a method of subdividing the document regions into text and non-text regions

  • The problem of document segmentation is a multiclass classification. It has been solved by extending binary classification into multiclass classification using one against one method

  • This paper demonstrates the modeling of document segmentation as classification task and describes the implementation of genetic programming approach for classifying various regions

Read more

Summary

INTRODUCTION

Document segmentation is defined as a method of subdividing the document regions into text and non-text regions. This research work associates the existing features specified in [4] [6] [8] and proposes few features which subsidizes more in document segmentation Features such as perimeter/height ratio, energy, entropy are employed. A block in document image is a connected component and it is defined as a collection of black runs that are 8-connected Both perimeter and height of the block diverges in their values. Each block of the document varies in its energy and entropy in case of table, drawings and image blocks. These new features offer a notable influence in document segmentation.

PROPOSED MODEL FOR DOCUMENT SEGMENTATION AND REGION CLASSIFICATION
Preprocessing
Segmentation using RLSA algorithm
Labeling connected components
Feature Extraction
LINEAR GENETIC PROGRAMMING-GENETIC PROGRAMMING MODEL
GP algorithm
GP operators
Classification Using Genetic Programming
Fitness function for classification
Linear Genetic Programming
EXPERIMENT AND RESULTS
One Against One
Training and testing in Discipulus
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.