Abstract

In this chapter we focus on the various aspects of the annotation labels used in tagging and of the design of the tagsets they belong to. We start with the determination of the possible linguistic contents of a tagset, that is to say: which level of linguistic description it operates on (morphology, syntax, semantics, discourse, etc.) and what is described on each level. We then discuss problems related to the treatment of multitoken units, ambiguities and underspecification in wordclass tagging. Finally, we deal with the more formal aspects of the tagset, such as the structuring of the information and the representation of the labels. This includes terminological questions (label names) as well as decisions on the manner of encoding. In addition, we discuss a proposal by the Text Encoding Initiative (TEI) for the encoding of linguistic analyses in corpora using Standard Generalized Markup Language (SGML).KeywordsAnnotation SchemeAnnotate CorpusSyntactic ParsingPossessive PronounSense ClassificationThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call