Abstract
In document layout analysis, the defining conditions for textlines and text regions involve certain numerical parameters (e.g. inter-character spacing and inter-textline spacing) whose values can only be estimated when textlines and text regions have already been formed. This seemingly chicken-and-egg problem can be solved through an adaptive regrouping strategy, which consists of three operations. First, we group basic ingredients into preliminary textlines and text regions according to crude parametric values. Second, we refine our estimate of the parametric values based on the groups thus formed. Third, we form new groups by splitting and merging existing groups based on the newly estimated values. This paper applies the above strategy to Chinese documents whose complexity derives from the coexistence of horizontal and vertical textlines. Successful results are obtained using this approach. The accuracy rates for identifying text regions and textlines are above 98% in a test database consisting of over one thousand document samples and various layout structures.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.