Abstract

Text chunking, Part-of-speech (POS) tagging, and named entity recognition (NER) are fundamental tasks in natural language processing (NLP). Part-of-speech (POS) tagging involves assigning grammatical labels to words in a sentence. Research shows that Armenian is a low-resourced language and there are not enough materials for developing higher accurate part-of-speech tagging systems in the Armenian language. This paper presents a fresh dataset for POS tagging in Armenian that follows the naming conventions of both Penn Treebank and Universal Dependencies tagsets, with two versions available. The dataset consists of 6081 sentences that were automatically annotated and then manually verified. The data was sourced from Armenian news websites, focusing on topics such as culture, medicine, and lifestyle, as well as 22 Armenian fairytales. The reason for having two versions of the POS tagset was to ensure compatibility and integration with all-natural language processing tools and models that use these standards. By standardizing the tagset, it becomes easier to compare and evaluate the effectiveness of different POS tagging models. The paper also describes data collection, cleaning, preprocessing, and processing steps. The ISMA translator was used for the annotation of the dataset, which not only performs machine translation but also conducts a syntactic and semantic analysis of the text and assigns a POS tag for each word in the sentence. The final corpus contains 13 groups of part-of-speech tags and a total of 57160 tagged tokens including the distinction between singular and plural parts of speech.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.