On the unsupervised analysis of domain-specific Chinese texts.

Ke Deng,Peter K Bol,Kate J Li,Jun S Liu

doi:10.1073/pnas.1516510113

Abstract

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On the unsupervised analysis of domain-specific Chinese texts.

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences

Lead the way for us

Journal: Proceedings of the National Academy of Sciences	Publication Date: May 16, 2016
Citations: 22

Similar Papers

A heuristic method based on a statistical approach for Chinese text segmentation
Christopher C Yang ... K W Li
Journal of the American Society for Information Science and Technology | VOL. 56
Christopher C Yang, et. al.Christopher C Yang ... K W Li
09 Sep 2005
Journal of the American Society for Information Science and Technology | VOL. 56

Segmenting Chinese Unknown Words by Heuristic Method
Christopher C Yang ... K W Li
-
Christopher C Yang, et. al.Christopher C Yang ... K W Li
01 Jan 2003
01 Jan 2003

Chinese text segmentation for text retrieval: Achievements and problems
Zimin Wu ... Gwyneth Tseng
Journal of the American Society for Information Science | VOL. 44
Zimin Wu, et. al.Zimin Wu ... Gwyneth Tseng
01 Oct 1993
Journal of the American Society for Information Science | VOL. 44

Error anaylsis of Chinese text segmentation using statistical approach
Christopher C Yang ... Kar Wing Li
-
Christopher C Yang, et. al.Christopher C Yang ... Kar Wing Li
07 Jun 2004
07 Jun 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the unsupervised analysis of domain-specific Chinese texts.

Abstract

Talk to us

Similar Papers

More From: Proceedings of the National Academy of Sciences