Inductive learning of formal languages, often called grammatical inference, is an active area inmachine learning and computational learning theory. By learning a language we understandfinding the grammar of the language when some positive (words from language) and negativeexamples (words that are not in language) are given. Learning mechanisms use the naturallanguage learning model: people master a language, used by their environment, by the analysis ofpositive and negative examples. The problem of inferring context-free languages (CFG) has boththeoretical and practical motivations. Practical applications include pattern recognition (forexample finding DTD or XML schemas for XML documents) and speech recognition (the abilityto infer context-free grammars for natural languages would enable speech recognition to modify itsinternal grammar on the fly). There were several attempts to find effective learning methods forcontext-free languages (for example [1,2,3,4,5]). In particular, Y.Sakakibara [3] introduced aninteresting method of finding a context-free grammar in the Chomsky normal form with a minimalset of nonterminals. He used the tabular representation similar to the parse table used in the CYKalgorithm, simultaneously with genetic algorithms. In this paper we present several adjustments tothe algorithm suggested by Sakakibara. The adjustments are concerned mainly with the geneticalgorithms used and are as follows:– we introduce a method of creating the initial population which makes use of characteristicfeatures of context-free grammars,– new genetic operations are used (mutation with a path added, ‘die process’, ‘war/diseaseprocess’),– different definition of the fitness function,– an effective compression of the structure of an individual in the population is suggested.These changes allow to speed up the process of grammar generation and, what is more, theyallow to infer richer grammars than considered in [3].
Read full abstract