Electronic Document Management is an essential workflow within every successful ERP implementation. The integration of these documents in their respective pipelines (e.g. OCR, data extraction) inside the ERP system for processing usually requires a previous classification step to improve the success rate. Unfortunately, due to the variation in type, size, and layout of business documents (i.e. invoices, checks, delivery forms), their classification is a challenging computer task and may need an additional data for model training. This paper investigates the Transfer Learning paradigm using different pre-trained deep models to extract useful features from scanned document images. In fact, the machine learning classifiers, such as Logistic Regression (LR), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Gaussian Naive Bayes (GNB) process the extracted features for classification. The authors compared the constructed models performances based on various metrics. To overcome the over-fitting issue and dataset imbalance, we run a crossvalidation procedure at different folds sizes (4, 6, and 8) to assess the models’ generalization ability. We also inspected the effect of dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) on the overall performances and execution time. We found that the best classification rate is 97.83% achieved by combining LR, LDA, and the DenseNet121 deep model. Despite the small used dataset (546 images), this excellent performance encourages the integration of this approach in an ERP system as a separate module for document preprocessing for ERP users
Read full abstract