To help non-native English speakers quickly master English vocabulary, and improve reading, writing, listening and speaking skills, and communication skills, this study designs, constructs, and improves an English vocabulary learning model that integrates Spiking Neural Network (SNN) and Convolutional Long Short-Term Memory (Conv LSTM) algorithms. The fusion of SNN and Conv LSTM algorithm can fully utilize the advantages of SNN in processing temporal information and Conv LSTM in sequence data modeling, and implement a fusion model that performs well in English vocabulary learning. By adding information transfer and interaction modules, the feature learning and the timing information processing are optimized to improve the vocabulary learning ability of the model in different text contents. The training set used in this study is an open data set from the WordNet and Oxford English Corpus data corpora. The model is presented as a computer program and applied to an English learning application program, an online vocabulary learning platform, or a language education software. The experiment will use the open data set to generate a test set with text volume ranging from 100 to 4000. The performance indicators of the proposed fusion model are compared with those of five traditional models and applied to the latest vocabulary exercises. From the perspective of learners, 10 kinds of model accuracy, loss, polysemy processing accuracy, training time, syntactic structure capturing accuracy, vocabulary coverage, F1-score, context understanding accuracy, word sense disambiguation accuracy, and word order relation processing accuracy are considered. The experimental results reveal that the performance of the fusion model is better under different text sizes. In the range of 100-400 text volume, the accuracy is 0.75-0.77, the loss is less than 0.45, the F1-score is greater than 0.75, the training time is within 300s, and the other performance indicators are more than 65%; In the range of 500-1000 text volume, the accuracy is 0.81-0.83, the loss is not more than 0.40, the F1-score is not less than 0.78, the training time is within 400s, and the other performance indicators are above 70%; In the range of 1500-3000 text volume, the accuracy is 0.82-0.84, the loss is less than 0.28, the F1-score is not less than 0.78, the training time is within 600s, and the remaining performance indicators are higher than 70%. The fusion model can adapt to various types of questions in practical application. After the evaluation of professional teachers, the average scores of the choice, filling-in-the-blank, spelling, matching, exercises, and synonyms are 85.72, 89.45, 80.31, 92.15, 87.62, and 78.94, which are much higher than other traditional models. This shows that as text volume increases, the performance of the fusion model is gradually improved, indicating higher accuracy and lower loss. At the same time, in practical application, the fusion model proposed in this study has a good effect on English learning tasks and offers greater benefits for people unfamiliar with English vocabulary structure, grammar, and question types. This study aims to provide efficient and accurate natural language processing tools to help non-native English speakers understand and apply language more easily, and improve English vocabulary learning and comprehension.