Abstract
A corpus-based statistical-oriented Chinese word classification can be regarded as a fundamental step for automatic or non-automatic, monolingual natural processing systems. Word classification can solve the problems of data sparseness and have far fewer parameters. So far, much relative work about word classification has been done. All the work is based on some similarity metrics. We use average mutual information as the global similarity metric to do classification. The clustering process is top-down splitting and the binary tree is growing with splitting. In natural language, the effect of left neighbours and right neighbours of a word are asymmetric. To utilize this directional information, we induce the left–right binary and the right–left binary tree to represent this property. The probability is also introduced in our algorithm to merge the resulting classes from the left–right and the right–left binary tree. Also, we use the resulting classes to do experiments on a word class-based language model. Some classes' results and the perplexity of a word class-based language model are presented.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.