Authors&amp;#x2019; names extraction from scanned documents

Manabu Ohta Manabu Ohta,Atsuhiro Takasu Atsuhiro Takasu,Takayuki Yakushi Takayuki Yakushi,Shun Yamasaki Shun Yamasaki

doi:10.1109/icdim.2007.4444202

Authors&#x2019; names extraction from scanned documents

Manabu Ohta Manabu Ohta, Atsuhiro Takasu Atsuhiro Takasu + Show 2 more

https://doi.org/10.1109/icdim.2007.4444202

Copy DOI

Abstract

Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as optical character recognition (OCR). In this paper, we describe an automatic authors' names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors' blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed hidden Markov model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99%, of authors' blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.

Full Text