Abstract

The analysis process Most taggers (the programs that tag an uncoded corpus) make use of several kinds of information. First, they have dictionaries which list the category or categories that a particular word can belong to. Some words, such as the and a are not ambiguous; they can be automatically identified as the definite and indefinite article. Other words are ambiguous, such as deal , which can be a noun or a verb. Dictionaries can also identify fixed expressions (e.g., identifying the sequence and so forth as an adverb or such that as a subordinator). Finally, dictionaries can have lists of words that take certain grammatical patterns (e.g., the verbs or nouns that can control complement clauses). For words that are ambiguous, many taggers make use of probabilistic information. This information is based on previous accurately tagged corpora (such as the LOB, for which all the grammatical tags were checked). The probabilistic information will tell the tagger how likely it is that a given word belongs to one class or another. Book , for instance, can be a verb or a noun, but it has a much higher probability of occurring as a noun. Probabilities can also be applied to a sequence of tags. For example, to disambiguate respect in the phrase “in respect of the,” the tagger would consider the probability of a preposition-verb-preposition sequence versus a preposition-noun-preposition sequence.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call