Abstract
Abstract The main obstacle to automated translation and processing of dialects is their dearth of linguistic resources. The latter provide data to natural language processing professionals to conduct their experiments of dialect recognition, processing, and machine translation. This article highlights the need to resource the Algerian dialects, reviews the use of the available relevant corpora, and describes the process and distinctiveness of the first Oranian-English parallel corpus (OEPC). This is the first parallel corpus that includes one Algerian dialect with its English equivalents made from scratch. Particularly, this article presents the criteria and steps of compiling a monolingual corpus for the Oranian dialect (ORN) with references to data sources and formats. The size of the monolingual corpus ORN reached 8.5K sentences; with their equivalents in English, OEPC has been built. This significant linguistic resource is made under the Empowering and Resourcing Algerian Dialects project. This project is launched to enrich NLP experts with linguistic resources that are different Algerian mono-, bi-, multi-, and cross-dialectal corpora. The mechanism of data compilation and augmentation to extend the products of this project is explained.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have