Integrating natural language processing with image document analysis: what we learned from two real-world applications

Jinying Chen,Premkumar Natarajan,Huaigu Cao

doi:10.1007/s10032-015-0247-x

Abstract

Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Abstract

Talk to us

Similar Papers

More From: International Journal on Document Analysis and Recognition (IJDAR)

Lead the way for us

Journal: International Journal on Document Analysis and Recognition (IJDAR)	Publication Date: May 28, 2015
Citations: 12

Similar Papers

Robust named entity detection from optical character recognition output
Krishna Subramanian ... Prem Natarajan
International Journal on Document Analysis and Recognition (IJDAR) | VOL. 14
Krishna Subramanian, et. al.Krishna Subramanian ... Prem Natarajan
13 Apr 2011
International Journal on Document Analysis and Recognition (IJDAR) | VOL. 14

Natural Language Processing and Computational Linguistics
Junichi Tsujii
Computational Linguistics | VOL. -
Junichi TsujiiJunichi Tsujii
07 Dec 2021
Computational Linguistics | VOL. -

OCR post-processing for low density languages
Okan Kolak ... Philip Resnik
-
Okan Kolak, et. al.Okan Kolak ... Philip Resnik
01 Jan 2004
01 Jan 2004

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Shaolei Feng ... R Manmatha
-
Shaolei Feng, et. al.Shaolei Feng ... R Manmatha
11 Jun 2006
11 Jun 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Abstract

Talk to us

Similar Papers

More From: International Journal on Document Analysis and Recognition (IJDAR)