Abstract

This paper describes work on the morphological and syntactic annotation of Sumerian cuneiform as a model for low resource languages in general. Cuneiform texts are invaluable sources for the study of history, languages, economy, and cultures of Ancient Mesopotamia and its surrounding regions. Assyriology, the discipline dedicated to their study, has vast research potential, but lacks the modern means for computational processing and analysis. Our project, Machine Translation and Automated Analysis of Cuneiform Languages, aims to fill this gap by bringing together corpus data, lexical data, linguistic annotations and object metadata. The project’s main goal is to build a pipeline for machine translation and annotation of Sumerian Ur III administrative texts. The rich and structured data is then to be made accessible in the form of (Linguistic) Linked Open Data (LLOD), which should open them to a larger research community. Our contribution is two-fold: in terms of language technology, our work represents the first attempt to develop an integrative infrastructure for the annotation of morphology and syntax on the basis of RDF technologies and LLOD resources. With respect to Assyriology, we work towards producing the first syntactically annotated corpus of Sumerian.

Highlights

  • The Sumerian language, an agglutinative isolate, is the earliest known language recorded in writing

  • We adopt a Linked Open Data approach for this purpose: We provide and consult an OWL representation of the Cuneiform Digital Library Initiative (CDLI) annotation scheme and its linking with Universal Dependencies (UD) POS, feature and dependency labels as part of the Ontologiexs of Linguistic

  • This paper describes work on the morphological and syntactic annotation of Sumerian cuneiform as a model for low resource languages in general

Read more

Summary

Introduction

The Sumerian language, an agglutinative isolate, is the earliest known language recorded in writing. It was spoken in the third millennium BC in southern Iraq, and continued to be written until the late first millennium BC. Assyriologists make a text available for research by first copying and transcribing it from the inscribed artifact. A dozen projects which make various cuneiform corpora available on-line have emerged, building on digital transcriptions created as early as the 1960s. These initiatives rarely use shared conventions, and the tool-set available. We employ Linguistic Linked Open Data (LLOD) technology to improve interoperability and resource integration for machine translation and linguistic annotation of Sumerian

Linked Open Data for Sumerian
The MTAAC Project
CoNLL Format
CoNLL-RDF
Annotation Workflow
Annotating Morphology
Dictionary-Based Pre-Annotation
Rule-Based Pre-Annotation with SPARQL
Application and Evaluation
Annotating Syntax
RDF-Based Pre-Annotation
Limits of Syntactic Pre-Annotation
Annotating Semantics
Machine Translation
Findings
Summary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call