Tagsets

Jan Cloeren

doi:10.1007/978-94-015-9273-4_4

Abstract

In this chapter we focus on the various aspects of the annotation labels used in tagging and of the design of the tagsets they belong to. We start with the determination of the possible linguistic contents of a tagset, that is to say: which level of linguistic description it operates on (morphology, syntax, semantics, discourse, etc.) and what is described on each level. We then discuss problems related to the treatment of multitoken units, ambiguities and underspecification in wordclass tagging. Finally, we deal with the more formal aspects of the tagset, such as the structuring of the information and the representation of the labels. This includes terminological questions (label names) as well as decisions on the manner of encoding. In addition, we discuss a proposal by the Text Encoding Initiative (TEI) for the encoding of linguistic analyses in corpora using Standard Generalized Markup Language (SGML).KeywordsAnnotation SchemeAnnotate CorpusSyntactic ParsingPossessive PronounSense ClassificationThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text