Abstract
The processing of unknown words is an important issue that impedes the performance of many natural language systems. In terms of the processing of financial public opinion, due to a wide variety of terminology in the financial sector and the suddenness of financial public opinion events as well as the proliferation of network vocabulary, a variety of unknown new words emerge while the traditional text classification method is basically incapable of processing these unrecognized characters. In this paper, an approach equipped with a new directional substitution model for processing the unknown words is proposed. Based on the semantic similarity of the context in word2ve algorithm, the model trains synonym substitution list and substitutes the unknown words in the original texts with synonyms. Also, the TFIDF (term frequency-inverse document frequency) -weighted Naive Bayes classifier is used to carry out the text classification experiments on the traditional datasets and the synthetic datasets including the unknown words. The experimental results show that the model has better classification effect than the traditional methods and can accurately identify the categories of the financial public opinion texts involving the unknown words by transforming the meaningless unknown words into the words containing meaning.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have