Abstract
In this paper, we present a new approach to automatic tag- ging without requiring any machine learning algorithm or training data. We argue that the critical information required for tagging comes more from word internal structure than from the context and we show how a well designed morphological analyzer can assign correct tags and disam- biguate many cases of tag ambiguities too. The crux of the approach is in the very denition of words. While others simply tokenize a given sen- tence based on spaces and take these tokens to be words, we argue that words need to be motivated from semantic and syntactic considerations, not orthographic conventions. We have worked on Telugu and Kannada languages and in this paper, we take the example of Telugu language and show how high quality tagging can be achieved with a ne grained, hierarchical tag set, carrying not only morpho-syntactic information but also some aspects of lexical and semantic information that is necessary or useful for syntactic parsing. In fact entire corpora can be tagged very fast and with a good degree of guarantee of quality. We give details of our experiments and results obtained. We believe our approach can also be applied to other languages.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.