ORCHID (Open linguistic Resources CHanelled toward InterDisciplinary research) is a project aimed at building linguistic resources to support research in, but not limited to, natural language processing. Based on the concept of an open architecture design, the resources must be fully compatible with those which already exist, and software tools must also be made available. This paper describes the construction of a Thai part-of-speech (POS) tagged corpus, a preliminary stage in the construction of a Thai speech corpus. The paper also details the development of a POS tagger to be used in the construction of the POS-tagged corpus. Additionally, we describe a proposal for a new tagset, based on the results of a prior multilingual machine translation project. The corpus is annotated on three levels: paragraph, sentence, and word. Text information is maintained in the form of the text information lines and the number lines, which are both utilized in data retrieval. Finally, we describe a POS neuro tagger, which consists of a three-layer perceptron with elastic input. Computer experiments show that the neuro tagger has an accuracy of 94.4 per cent for tagging ambiguous words when tested on a small Thai training corpus containing 22,311 ambiguous words. A series of comparative experiments further show that the neuro tagger is superior to the statistical models including the frequency model (a baseline model), local n-gram model, and HMM (Hidden Markov Model).
Read full abstract