Abstract

Abstract Digital Humanities (DH) within Coptic Studies, an emerging field of development, will be much aided by the digitization of large quantities of typeset Coptic texts. Until recently, the only Optical Character Recognition (OCR) analysis of printed Coptic texts had been executed by Moheb S. Mekhaiel, who used the Tesseract program to create a text model for liturgical books in the Bohairic dialect of Coptic. However, this model is not suitable for the many scholarly editions of texts in the Sahidic dialect of Coptic which use noticeably different fonts. In the current study, DH and Coptological projects based in Göttingen, Germany, collaborated to develop a new Coptic OCR pipeline suitable for use with all Coptic dialects. The objective of the study was to generate a model which can facilitate digital Coptic Studies and produce Coptic corpora from existing printed texts. First, we compared the two available OCR programs that can recognize Coptic: Tesseract and Ocropy. The results indicated that the neural network model, i.e. Ocropy, performed better at recognizing the letters with supralinear strokes that characterize the published Sahidic texts. After training Ocropy for Coptic using artificial neural networks, the team achieved an accuracy rate of >91% for the OCR analysis of Coptic typeset. We subsequently compared the efficiency of Ocropy to that of manual transcribing and concluded that the use of Ocropy to extract Coptic from digital images of printed texts is highly beneficial to Coptic DH.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call