Abstract

Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages. The morphological tagset used in the Slovak National Corpus has been designed with several goals in mind – the tags are compact and easily human-readable, without sacrificing their informational contents. The tags consist of ASCII letters, numbers and several other characters. In general, they have a variable number of symbols, but their order is obligatory, and each category or specific feature is assigned a particular character, which can be shared among several parts of speech. The tagset is highly functional and pragmatic, although some allowances had to be made to accommodate traditional analysis of Slovak morphology and part of speech categories. In particular, function words are classified according to their syntactic (and semantic) roles, which is a reason why the tagset is sometimes described as a morphosyntactic one.

Highlights

  • Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages

  • The tagset has been influenced by the Slovak National Corpus tagset described – one of the design goals was to make an automatic conversion from the SNK tagset into the MTE not too difficult

  • Checking the part of speech category could be expressed in a Python-like pseudolanguage as2: if tag[0]==’S’: # noun # proceed with the noun and checking for the value of the grammar category can be as easy as: 2 We are counting the positions from zero

Read more

Summary

Radovan Garabík and Mária Šimková

Morphological annotation constitutes essential, very useful and very common linguistic information presented in corpora, especially for highly inflectional languages. Morphological annotation constitutes fundamental and very common linguistic information found in corpora, especially for inflectional languages. It comprises the part of speech categorisation of lemmas and morphological characterisation of a word (token). Tokenisation, lemmatisation and the principles of morphological annotation used in manual tagging of the rmak corpus are described in the user guide (Garabík et al, 2004). This tagset is an adaptation of the Czech language tagset developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague (Hajič, 2004) It is a positional tagset of fixed length, each tag containing 15 ASCII characters. The codes for the values can be reused across attributes – the tags are of unequal length ( a rather important feature is that the value assignment does not depend on a part of speech)

Multext East
Tag Structure
Grammatical Number
Grammatical Gender
Degree of Comparison
Description part of speech tag paradigm gender number case
Position Possible values
Description part of speech tag form aspect number person negation
Other Categories
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.