ArmSpeech-POS: Eastern Armenian Part-of-Speech Tagged Corpus

Varuzhan H Baghdasaryan

doi:10.51542/ijscia.v4i2.19

Abstract

Text chunking, Part-of-speech (POS) tagging, and named entity recognition (NER) are fundamental tasks in natural language processing (NLP). Part-of-speech (POS) tagging involves assigning grammatical labels to words in a sentence. Research shows that Armenian is a low-resourced language and there are not enough materials for developing higher accurate part-of-speech tagging systems in the Armenian language. This paper presents a fresh dataset for POS tagging in Armenian that follows the naming conventions of both Penn Treebank and Universal Dependencies tagsets, with two versions available. The dataset consists of 6081 sentences that were automatically annotated and then manually verified. The data was sourced from Armenian news websites, focusing on topics such as culture, medicine, and lifestyle, as well as 22 Armenian fairytales. The reason for having two versions of the POS tagset was to ensure compatibility and integration with all-natural language processing tools and models that use these standards. By standardizing the tagset, it becomes easier to compare and evaluate the effectiveness of different POS tagging models. The paper also describes data collection, cleaning, preprocessing, and processing steps. The ISMA translator was used for the annotation of the dataset, which not only performs machine translation but also conducts a syntactic and semantic analysis of the text and assigns a POS tag for each word in the sentence. The final corpus contains 13 groups of part-of-speech tags and a total of 57160 tagged tokens including the distinction between singular and plural parts of speech.

Full Text