Abstract

Automatic text classification is a research focus and core technology in information retrieval and natural language processing. Different from the traditional text classification methods (SVM, Bayesian, KNN), the class-center vector method is an important text classification method, which has the advantages of less calculation and high efficiency. However, the traditional class-center vector method for text classification has the disadvantages that the class vector is large and sparse, and its classification accuracy is not high because of the lack of semantic information. To overcome these problems, this paper proposes a novel class-center vector model for text classification using dependencies and a semantic dictionary. We respectively use WordNet English semantic dictionary and Tongyici Cilin Chinese semantic dictionary to cluster the English or Chinese feature words in the class-center vector and to significantly reduce the dimension of class-center vector, thereby realizing a new class-center vector for text classification using dependencies and a semantic dictionary. Experiments show that, compared with traditional text classification algorithms, the improved class-center vector method has lower time complexity and higher accuracy on the 20Newsgroups English corpus, Fudan and Sogou Chinese corpus. This paper is an improved version of our NLPCC2019 conference paper.

Highlights

  • With the rapid development and increasing popularity of Internet technology, electronic text information is expanding rapidly

  • TFIDF algorithm, we introduce dependencies, synonyms in the semantic dictionary and the part-of-speech to understand and optimize the text feature, and put forward an improved weight calculation method based on TFIDF. (2) We respectively use the category nodes located in the 6-9 layers of WordNet and the category code with ‘‘#’’ in Tongyici Cilin Extension Version to cluster the English or Chinese feature words in the class-center vector and to significantly reduce the dimension of classcenter vector, thereby realizing a new class-center vector

  • After classifying the text features in the corpus according to dependencies, this paper proposes the following TFIDF weight calculation method based on dependencies and the synonyms in the semantic dictionary

Read more

Summary

INTRODUCTION

With the rapid development and increasing popularity of Internet technology, electronic text information is expanding rapidly. TFIDF algorithm, we introduce dependencies, synonyms in the semantic dictionary and the part-of-speech to understand and optimize the text feature, and put forward an improved weight calculation method based on TFIDF. After classifying the text features in the corpus according to dependencies, this paper proposes the following TFIDF weight calculation method based on dependencies and the synonyms in the semantic dictionary. According to the result of dependency syntactic analysis implemented by Stanford Parser, we get the sentence component of jth (1 ≤ j ≤ m) occurrence of the feature word ti in the text, and classify the sentence component as the ki,j level according to TABLE 2 and assigns it a weight wi,j, which is calculated as follows: wi,j = 2 cos ki,j λ π. Where s denotes the total number of words in the text where feature ti is located and D denotes the total number of texts in the corpus, pi denotes the number of the texts containing the feature word ti

TFIDF WEIGHT IMPROVEMENT BASED ON PART-OF-SPEECH
CLASS-CENTER VECTOR CLUSTERING APPROACH BASED ON A SEMANTIC DICTIONARY
A NEW VECTOR SIMILARITY METHOD FOR CLUSTERED CLASS-CENTER VECTORS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.