Abstract

This paper presents an off-topic detection algorithm combining LDA and word2vec aiming at the problem of the lack of accurate and efficient off-topic detection algorithms in the English composition-assisted review system. The algorithm uses the LDA model to model the document and train the document through the word2vec, and uses the semantic relationship between the document's topics and words to calculate the probability weighted sum for each topic and its feature words in the document, and finally selects the off-topic composition by setting a reasonable threshold. Different F values are obtained by changing the number of topics in the document, and the best number of topics is determined. Experimental results show that the proposed method is more effective than vector space model, can detect more off-topic compositions, and the accuracy is higher, the F value is more than 88%, which realizes the intelligent processing of off-topic detection of composition, and can be effectively applied in English composition teaching.

Highlights

  • The most commonly used and classic text representation model is the vector space model, and the TF-IDF algorithm based on the vector space model is the most widely used method to calculate the text similarity

  • Among them: because zi represents the subject variable corresponding to the ith word; ┐ i means that the ith word is not included, so z┐ i represents the probability distribution of all topics zk (k61⁄4i); zðtÞ k;┐i indicates that the feature word t belongs to the word frequency of topic k; zðkÞ m;┐i represents the size of the feature word set assigned to the topic k by the document m

  • The off-topic composition detected by the experimental results is compared with the off-topic composition graded manually, and a comprehensive evaluation and analysis is carried out from the accuracy rate, the recall rate and the F value to verify the effectiveness and practicability of the algorithm in the experiment

Read more

Summary

Introduction

Composition is an important means to express emotion and transmit information, while the theme is the soul of composition. The most commonly used and classic text representation model is the vector space model, and the TF-IDF algorithm based on the vector space model is the most widely used method to calculate the text similarity This method manifests the weight of the word by the frequency of the word appearing in the document and the frequency of the word appearing in the document collection. The English words "like" and "love", for example, they all mean like, but in the vector space model, they are treated as two separate lexical items For this disadvantage, some researchers have proposed methods of word extension, such as using dictionaries Word Net, How-Net for word extension. A new method of text similarity calculation is proposed for the deficiency of the above methods, and it is used to test the off-topic of English composition. Compared with vector space model-based method, how is the effectiveness of the off-topic detection method based on LDA and word2vec?

LDA model
Gibbs sampling
LDA modeling process
Topic correlation calculation based on LDA and word2vec
Word2vec
Calculation of subject correlation
Off-topic detection algorithm
Experimental results and comparative analysis
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call