Abstract

Sentence extraction is a new, challenging and critical step in the printed scanned imaged documents. In this paper, an efficient 4-layered Sentence Detection and Extraction System (SDES) model is proposed which is designed to detect and extract sentences from machine printed imaged document. Its internal details and architecture clearly show that how it processes an image to find out the underlying sentences. The basic idea is to first preprocess the imaged document for noise removal and skew correction, and then textual entities are detected and segmented at page, line and word levels. Firstly, the horizontal and vertical projection profiles are taken to segment and separate the lines and words. After skew correction, two stage Character Based and Word Based Leveled matching and testing are performed, which verify and identify the correct character and word by searching for similar textual characters and words in Character Set Storage (CSS) and Word Pseudo Thesaurus (WPT). If any word pattern is not matched and identified by WPT, then it is stored in the Unmatched Word Storage (UWS) for the future reference. Such testing and verification are used at two levels to increase the accuracy% of SDES, and thereby, reducing the errors. It increases the system performance greatly. Finally, all the sentences of imaged document are extracted. Experimental results are found at the word, character and sentence levels. Their accuracy% results are good which show the high system performance and efficiency.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.