In recent years, speech recognition technology based on deep learning model has made great progress, and the accuracy of speech recognition has reached more than 90%. In foreign language learning, speech evaluation is an important application. Billions of foreign language learners need to practice effective pronunciation. However, due to the different goals between speech recognition and speech evaluation, a single speech recognition model cannot be directly applied to pronunciation evaluation. This paper proposes a DDNN (double-layer deep neural network) model, which includes the speech text alignment model and speech recognition model. In the first layer of the speech alignment model, a new Viterbi algorithm method is proposed to find the best path for the alignment of speech and text. In the second layer of speech evaluation and scoring, we are the first to use the CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) on the encoding part of Attention. The accuracy of CTC model reaches 76.7%, and that of attention model is 81.2%. The experimental results show that the speech and text alignment method is effective, and the speech evaluation results based on the Attention model are better. The FRR (false rejection rate), FAR (false acceptance rate), and DER (diagnostic rate) in the Attention model were 4.5%, 5.1%, and 17.9%, respectively. At the same time, the evaluation of each sentence of the DDNN model in the online experiment is within 1 second, so the model can also be applied to the online real-time evaluation of speech pronunciation.