Abstract
Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.
Highlights
Linguistic data are necessary for many applications to facilitate communication, or at least to build linguistic datasets for use in natural language processing
Our research aims to develop a morphological tagger as the main component of a comprehensive part-of-speech tagger for standardizing the Albanian language by constructing an annotated corpus for it
EXPERIMENTAL AND RESULTS We have obtained our results by implementing the corpus in Albanian using the Natural Language Toolkit
Summary
Linguistic data are necessary for many applications to facilitate communication, or at least to build linguistic datasets for use in natural language processing. About 7 million native Albanian speakers live in Albania, Kosovo, North Macedonia, and other Balkan countries. The Albanian language has complex grammar, which makes it an interesting and unique language to study This paradigm makes morphological tagging and lemmatization challenging. Annotations, lemmatization tools, morphological analysis tools, and part-ofspeech tagging are not widely accessible for the language. We aim to create a corpus of manually annotated part-of-speech tags, morphological features, and lemmas. Part-of-speech tagging focuses on taking a text as an input and producing an output text where every word has an associated grammatical category such as an adjective, noun, verb, number, pronoun, etc
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.