Abstract

Existing data clustering method lacks considering of latent similar information existing among words,and it leads to unsatisfactory clustering result.Aiming at Chinese short message text clustering,this paper proposes a clustering algorithm based on semantic.It offers Chinese concept,and the measuring methods to calculate the similarity degree about words and Chinese short message text.It completes the clustering of Chinese short messages text through fission downwards and mergence of twos upwards.Experimental results show that this algorithm has better clustering quality than traditional algorithm. Text clustering is an un-supervising machinery learning. By analyzing the text content, the text shall be divided into many meaningful classifications, in which the similarity of the same classification shall become as high as possible, and the similarity of the different classification shall become as low as possible. Now, the common text clustering algorithms are mainly hierarchical clustering method represented by G-HAC algorithm and flat division method represented by K-means algorithm. There are many achievements on text clustering at home and abroad. For example, text clustering algorithm based on semantic filtering model in literature(1); text clustering algorithm based on fuzzy concepts in literature(2); text clustering algorithm based on swarming intelligence Web in literature(3); text clustering algorithm based on semantic inner space in literature(4); achieving a high efficient text clustering algorithm by the chain fission downward and the two-two merging upward, based on the up-down relationship of primitive, constructing a primitive concept tree in literature(2) and so on. In literature(6) based on HowNet model, the author put forward a similarity calculation algorithm, but this algorithm only can apply to the similarity calculation between words and concepts and does not provide the text similarity calculation analysis. This article analyzes the text from the perspective of semantics, making semantic disambiguation firstly(7), expressing the texts as a keyword set, calculating the similarity of words with the similarity of non-weak primitives, and calculating the similarity of texts with the similarity of words. This algorithm analyzes the similarity among texts from the perspective of semantics, so the results better fit for people's institution.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.