Subject of Research: The paper considers the problem of automatic term extraction from natural language texts (text mining). One of the first-priority problems in this topic is creation of domain thesaurus. Some well approved methods of terms extraction exist for alphabetic languages, for instance, the latent semantic analysis. Applying of these methods for hieroglyphic texts is challenged because of missing blanks between words. The sentences segmentation task in hieroglyphic languages is usually solved by dictionaries or by statistical methods, particularly, by means of a mutual information approach. Methods of sentences segmentation, as methods of terms extraction, separately, do not reach 100 percent accuracy and fullness, and their consistent applying just increases a number of errors. The aim of this work is improving the fullness and accuracy of domain terms extraction from hieroglyphic texts. Method: The proposed method lies in detection of repeating two, three or four symbol sequences in each sentence and correlation of occurrence frequencies for these sequences in domain and contrast documents collection. According to research carried out it was stated that a trivial ranging of all possible symbol sequences enables to extract satisfactory only frequently using terms. Filtering of symbol sequences by their ratio of frequencies in the domain and contrast collection gave the possibility to extract reliably frequently used terms and find satisfactory rare domain terms. Some results of terms extraction for the Network technologies domain from a Chinese text are presented in this paper. A set of articles from the newspaper Renmin Ribao was used as a contrast collection and some satisfactory results were obtained.
Read full abstract