Abstract

An universal method for wordform recognition and generation is proposed. It includes the model of word formation in natural languages and algorithms for wordform generation and recognition. In the wordformation model, abstracting from typical notions in the morphology of natural languages takes place Actual problems of machine translation, spelling checking, text analysis and knowledge extraction from them, computer–user dialog in a natural language, analysis of queries in information search systems of global information networks include multilevel automatic text processing. This levels are morphological, syntax and semantic. There are a large number of methods for solving the morphological level problems in text processing (see short review in [1]). However, they are designed for one or several languages that are similar in their word formation. At the same time, in connection with worldwide globalization, it is topical to develop applications for text pro cessing in different languages. Therefore, the necessity arises to create a method for morphological processing of texts in natural languages of different groups and families. This work proposes a universal method for the generation and recognition of word forms independent of the formation type and capable of processing the entire paradigm of a word. The method includes a model for the formation and wordform generation and recognition algorithm. The Model of formation in a natural language (below the model of formation) implies that the generation of any word form with a given grammatical meaning can be presented as the sequence of a finite number of transformations of the base. The method of presenting formation, as well the generation and recognition algorithms are interrlated. On the one hand, one part governs the other. On the other, simplification of one part causes complication of the other. The objective of the word form generation algorithm is to obtainthe word form F corresponding to the base S and grammatical meaning G. The result of word form generation is always unambiguous. The input data are S and G. The output data are F. The generation algorithm consists of two stages. At the first stage a required transformation chain R is searched using the input data, the base S, and grammatical meaning G. The second stage is the application of the found chain R to the base S to obtain the required form F. The wordform recognition algorithm recognizes the grammatical meaning G and the base S that correspond to the starting word form F. The input data are F. The output data are S and G. This algorithm consists in searching the chains of transformations in a cycle. At each iteration, a new chain Ri is taken as well as a respective base type Ti and a grammatical meaning Gi. Further, the next reverse chain is applied to the starting word form F to obtain the base Si. If the same base type Ti corresponds to the found base Si and the next chain then the result that consists of the obtained chain Si and the grammatical meaning Giis added to the obtained results. Thus, the iteration is completed and a transition to the next iteration occurs. The number of iterations equals n, the number of the chains of transformations. Thus, we have developed a universal method for wordform generation and recognition that includes the model of formation and the processing algorithm. The proposed processing algorithms are applicable to natural languages of different groups and families. Algorithms for the Russian, German, Spanish, and Finnish languages were successfully tested. The transformation chains were classified and the algorithm of automated construction of transformation chains was proposed based on this classification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call