English corpus and literary analysis based on statistical language model

Bo Huang,Xijun Lan

doi:10.1007/s10586-018-2454-y

Abstract

In this paper, the cross-language retrieval model based on statistical language model, cross-lingual text categorization method and cross-lingual text clustering method are studied systematically and deeply. Without any help of cross-lingual resources such as machine translation and bilingual dictionaries, this paper can solve the many-to-many problem of word translation in CLIR and solve the problem of unregistered words partially. Under a unified framework, a series of topics are extracted from bilingual parallel corpora to form the thematic space for each language. Thematic space of each language exists independently, and the bilingual subject space is established through the bilingual semantic correspondence. The bilingual subject space reflects the semantic correspondence between documents and documents, words and words. It reveals the inherent structure and internal relations among languages and languages.

Full Text