Abstract

Optical Character Recognition (OCR) plays an important role in the creation of digital language resources. As OCR solutions are often language specific, the availability of models for South African languages also contributes to alleviating the language data scarcity problem. We describe the development of a digitisation pipeline in the context of a multilingual corpus project. We test a recently developed OCR model for the Setswana language against a selection of quality assured texts, while improving our output using image processing software and a newly developed tool, Ontrafel, for post-processing OCR output in PDF files. Each step in the pipeline is shown to improve the output quality when measured against the Character Error Rate metric. Finally, a qualitative analysis provides some insights that may contribute to refining steps or improving the existing OCR model. Apart from the creation of new digital language data for Setswana, we hope that our work stimulates and contributes to further research into high-quality digitisation of South African language resources.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call