Abstract
Objectives: To develop a desktop application that automatically classifies a document as to which area of accreditation documents it should belong to. Specifically, it aims to: a) To create a predictive model that addresses document classification tasks. b) To design and develop an application that classifies documents according to document classification. c) To evaluate the performance measures of the automatic document classification. Methods: We introduce an innovative approach for the automatic classification of accreditation documents. Specifically, an approach of including scanned or captured documents in classification task using Optical Character Recognition (OCR); use TFIDF (Term-frequency Inverse Document Frequency) with stopwords removal, ngram of 1-2 in preprocessing of the text documents; and Naive Bayes algorithm with additive (Laplace/Lidstone) smoothing as a classifier in building our model. Results: Performance measures such as accuracy, precision, recall, and f-score were conducted to evaluate the efficiency of the study. The results showed 82% accuracy, 84% precision, 82% recall, and 82% F-1 score. As we explore the use of OCR for text extraction, TF-IDF for text preprocessing, and Naive Bayes classifier, the results indicate that the proposed approach is efficient. Conclusions: Classification of input documents in whatever forms, may it be captured image, scanned or simple text documents were obtained using OCR, TF-IDF, and Naive Bayes classifier. It provides an efficient way of automatic classification of accreditation documents and it gives an avenue to address limiting factors of the previous works, i.e classifying documents based on one’s opinion and time-consuming classification. Keywords: Accreditation Document Classification; Document Classification Objective Evaluation; TF-IDF; Term frequency-inverse document frequency; Multinomial Naive Bayes; OCR; Optical Character Recognition
Highlights
Accreditation is one way for HEIs (Higher Education Institutions) to keep themselves in check with the standards
In each HEI, there are accreditation tasks that are assigned to collect and classify documents. These documents are in the forms of Portable Document Format (PDF), Document File Format (Doc), and Scanned PDF
A system that automatically classifies documents is invaluable to the assigned accreditation task force
Summary
Accreditation is one way for HEIs (Higher Education Institutions) to keep themselves in check with the standards. Accreditation requires documents if the standards are being met of a particular program in the HEI. In each HEI, there are accreditation tasks that are assigned to collect and classify documents. These documents are in the forms of Portable Document Format (PDF), Document File Format (Doc), and Scanned PDF. The traditional way of classifying these documents is dependent on the assigned accreditation task force’s judgment and is, subjective. Since it is subjective, it is time-consuming. A system that automatically classifies documents is invaluable to the assigned accreditation task force
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.