Abstract

The old manuscripts kept in libraries are a part of the richest cultural heritage and legacy of civilizations. Digitalization is a solution for the preservation of this cultural and historical heritage, which is very difficult to handle for users. The automatic or manual transcription of old Arabic manuscripts is an inevitable stage for the indexing and the diffusion of the contents of these manuscripts, the cursive nature of the Arabic writing presents an handicap for the software of optical character recognition (OCR).A transcription with software of text processing or with HTML format is not a better solution, owing to the fact that the content is not structured. The complex structure of Arabic Manuscripts may be brought closer to a hierarchical model. In fact, the strong dependence between the description of the data structure and how they are stored on physical media, provides a rigorous structures and paths of access, while maintaining relative simplicity of implementation. The creation of such a document database according to a hierarchical model requires coding and cataloging of heritage documents. eXtensible Markup Language (XML) provides a way to structure these documents by providing solutions that ensure data integrity. In the field of documentary heritage, where each document is referenced with a unique code by archivists, this code can be used as an identifier of the manuscripts in our document database to avoid data redundancy. The coding of documents will be validated by XML schemas by providing format checking, type and semantics of data in XML files. The process of identification, collection and registration information is provided by a search engine based on metadata and annotations. These annotations are used to generate XML tags in order to facilitate the transcription of Arabic manuscripts and feeding our documentary database. The images transcription of patrimonial documents, in particular the old Arabic manuscripts, require an encoding XML in conformity with recommendations Text Encoding Initiative (TEI). It is a XML-TEI encoding aiming to standardize the coding of these documents and to facilitate their exploitation, their exploration and their diffusion on line or off line. In this paper we propose a search engine of ancient Arabic manuscripts based on metadata and XML annotations, allowing searches in the database powered by handwritten transcribed documents and the indexed images corresponding to users' queries. The rich functionality, intuitive user interface, portability, extensibility and the powerful of the XML technology all make the search engine platform an ideal explorer for handling ancient Arabic manuscripts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call