Mining Bilingual Linguistic Patterns with Aligned and Parsed Bilingual Corpus

Bo Wang ,Yuexian Hou ,Fanqi Meng

doi:10.4156/jcit.vol7.issue12.8

Abstract

Classical grammar for natural languages, which is defined by the linguistics, is widely used in many natural languages processing (NLP) tasks, such as information extraction, machine translation and parsing. The classical grammar is well defined but is context free and does not include the complex patterns which contain multiple linguistic units. On the other hand, there are also many simple patterns which are not included in the classical grammar but are useful in the NLP tasks. Therefore, the recognition of special linguistic patterns from natural language is an important step in various NLP systems. We propose an unsupervised method to automatically discover the complex monolingual linguistic patterns from a classically parsed and aligned bilingual corpus. And all the patterns in one language are qualified by the other parallel language. A specialized and efficient algorithm is applied to mine the frequent bilingual subtrees in the forest and the found subtrees are formalized as the linguistic patterns.

Full Text