Abstract

The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call