Abstract

Most of word recognition systems rely on a pre-defined lexicon in aims to achieve high performance. Recently, the availability of training /testing data allows to include a huge number of words in the lexicon to recognize. However, this leads to high computation cost as the lexicon is grown. In addition, including more and more word-classes may lead to increase the burden on classification methods and degrade the recognition rate. In this work, we propose a holistic word descriptor for word lexicon reduction in Arabic handwritten documents. The proposed descriptor represents geometrical features of word shape through three main feature sets, defined from multi-scale convexity concavity analysis. The first two sets are dedicated to defined the number of peaks and their intensity levels of convexity/concavity peaks, respectively. In contrast, the last set is dedicated to define a region codes of the peaks by analyzing their regions according to their spatial information. Given a query word and lexicon(reference dataset), the lexicon reduction system is applied by first defining the holistic word descriptor for both query word and each word in the lexicon. The lexicon is then indexed according to its distances to the query word descriptor. Finally, the reduced lexicon is formulated from the first kth entries of the indexed lexicon. The proposed system has been evaluated under two well-known Arabic datasets, namely Ibn Sina and IFN/ENIT. Reported results show superior performance compared to prior art, with 93.7% and 91.2% reduction efficacy for Ibn Sina and IFN/ENIT, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.