Abstract

Programs with a natural-language user interface and text-processing programs require a vocabulary providing the mapping of the individual word form onto a lexeme, e.g. “says”, “said”, “saying”→“see”. Examples of such programs are indexing programs for information retrieval, and spelling correctors for text-processing systems. The lexicographical task of such a computer vocabulary is especially difficult for Slavic languages, because their morphological structure is complex. An average Czech verb, for example, has 25 forms, and we have identified more than 100 paradigms for verbs. In order to support the creation of a Czech vocabulary, we have designed a system of programs for paradigm identification and derivation of words. The result of our effort is a vocabulary comprising 110 000 words and 1250 000 word forms. This vocabulary was used for the PASSAT system in the Czechoslovak Press Agency. This vocabulary may also be used in a spelling corrector. However, for such an application the vocabulary must be compressed into a compact form in order to shorten the access times. Compression is based on the paradigmatic structure of morphology which defines suffix sets for each word.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.