Abstract
For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have