Abstract

The vector representation is one of the important parts in document clustering or classification, which can quantify the text. In this paper, a novel Cooccurrence Latent Semantic Vector Space Model (CLSVSM) is presented and the co-occurrence distribution is further studied. This model is developed based on the Vector Space Model (VSM), embedding the co-occurrence latent semantic of the documents’ keywords to represent their vectors. First, experiments were conducted to test the model performance, using documents from Chinese National Knowledge Infrastructure (CNKI). The results showed the Entropy (E), Purity (P) and F1 value of CLMSVM is 20% better than in VSM in the documents clustering testing, which reveals that CLSVSM can improve the accuracy of clustering of documents, meanwhile reducing sparse degree of vectors. Second, it is the best to estimate the latent semantic: maximum (MAX), minimum (MIN), average (AVE), and median (MED)? More experiments are performed to compare the four estimators. The results indicate that Max and AVE are preferred method, while MIN method is the worst, which coincided with the discussion. Some essential questions were discussed at the end. These questions related to the trends of co-occurrence frequency, the function of co-occurrence intensity and its distribution, which reinforced the model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call