Investigating coupling preprocessing with shallow and deep convolutional neural networks in document image classification

Yi Liu,Elizabeth Lorang,Leen-Kiat Soh

doi:10.1117/1.jei.30.4.043024

Yi Liu, Elizabeth Lorang + Show 1 more

Open Access

https://doi.org/10.1117/1.jei.30.4.043024

Copy DOI

Abstract

Convolutional neural networks (CNNs) are effective for image classification, and deeper CNNs are being used to improve classification performance. Indeed, as needs increase for searchability of vast printed document image collections, powerful CNNs have been used in place of conventional image processing. However, better performances of deep CNNs come at the expense of computational complexity. Are the additional training efforts required by deeper CNNs worth the improvement in performance? Or could a shallow CNN coupled with conventional image processing (e.g., binarization and consolidation) outperform deeper CNN-based solutions? We investigate performance gaps among shallow (LeNet-5, -7, and -9), deep (ResNet-18), and very deep (ResNet-152, MobileNetV2, and EfficientNet) CNNs for noisy printed document images, e.g., historical newspapers and document images in the RVL-CDIP repository. Our investigation considers two different classification tasks: (1) identifying poems in historical newspapers and (2) classifying 16 document types in document images. Empirical results show that a shallow CNN coupled with computationally inexpensive preprocessing can have a robust response with significantly reduced training samples; deep CNNs coupled with preprocessing can outperform very deep CNNs effectively and efficiently; and aggressive preprocessing is not helpful as it could remove potentially useful information in document images.

Highlights

IntroductionConvolutional neural networks (CNNs), inspired by biological visual processes, have been popularly and successfully applied as a type of deep learning network in image-related classification approaches for generic images (e.g., hyperspectral images,[1–3] scenes,[4,5] plant images,[6,7] and graphic images8–12) and image-related denoising approaches (e.g., Gaussian noise,[13,14] rain effects,[15] snow effects,[16] and general frameworks[17])
Convolutional neural networks (CNNs), inspired by biological visual processes, have been popularly and successfully applied as a type of deep learning network in image-related classification approaches for generic images and image-related denoising approaches (e.g., Gaussian noise,[13,14] rain effects,[15] snow effects,[16] and general frameworks[17])
To gain more generalizable insights into the investigations, we use two classification tasks: (1) a binary poem classification task in which a CNN is trained to determine whether a document image snippet is a poem or not using the Aida-17k75 dataset and (2) 16-class document type classification[21] in which a CNN is trained to label document images into 16 different classes, [16 document image classes are: (1) letter, (2) memo, (3) email, (4) file-folder, (5) form, (6) handwritten, (7) invoice, (8) advertisement, (9) budget, (10) news article, (11) presentation, (12) scientific publication, (13) questionnaire, (14) resume, (15) scientific report, and (16) specification.] using the RVL-CDIP21 dataset

Summary

Introduction

Convolutional neural networks (CNNs), inspired by biological visual processes, have been popularly and successfully applied as a type of deep learning network in image-related classification approaches for generic images (e.g., hyperspectral images,[1–3] scenes,[4,5] plant images,[6,7] and graphic images8–12) and image-related denoising approaches (e.g., Gaussian noise,[13,14] rain effects,[15] snow effects,[16] and general frameworks[17]). There have been significantly more results and findings of CNN-based approaches on generic images (e.g., picture-based or graphic images) than on document images, facilitated by highly competitive challenges, such as CIFAR,[18] ImageNet,[19] and MNIST,[20] that comprehensively compared both deep and shallow. Findings on the application of CNNs on generic images do not necessarily generalize to document images[39–41] as these two types of images are very. The role of color-based visual cues, which are used in approaches for generic images, is diminished. Another property often found in document images is denser structural layouts, which make document images more susceptible to degradations

Objectives

Methods

Results

Conclusion