Building a Thai part-of-speech tagged corpus (ORCHID).

Virach Sornlertlamvanich,Hitoshi Isahara,Naoto Takahashi

doi:10.1250/ast.20.189

Abstract

ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is an initiative project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with similar resources, and software tools must also be made available. This paper presents one result of the project, the construction of a Thai part-of-speech (POS) tagged corpus, which is a preliminary stage in the construction of a Thai speech corpus. The POS-tagged corpus is the result of collaborative research between the Communications Research Laboratory (CRL) in Japan and the National Electronics and Computer Technology Center (NECTEC) in Thailand, with technical support from the Electrotechnical Laboratory (ETL) in Japan. In this paper, we propose a new tagset, based on the results of a prior multilingual machine translation project. The corpus is annotated on three levels: the paragraph, sentence, and word levels. Text information is maintained in the form of the text information lines and the number lines, which are both utilized in data retrieval. Both word segmentation and POS tagging were carried out by way of a probabilistic trigram model. Rules for syllable demarkation were additionally used to reduce the number of candidates in computing tagging probabilities

Full Text