Abstract

With the increasing maturity of speech synthesis technology, on the one hand, it has been more and more widely used in people’s lives; on the other hand, it also brings more and more convenience to people. The requirements for speech synthesis systems are getting higher and higher. Therefore, advanced technology is used to improve and update the accent recognition system. This paper mainly introduces the word stress annotation technology combined with neural network speech synthesis technology. In Chinese speech synthesis, prosodic structure prediction has a great influence on naturalness. The purpose of this paper is to accurately predict the prosodic structure, which has become an important problem to be solved in speech synthesis. Experimental data show that the average error of samples in the network training process is lel/85, and the minimum value of the training error after 500 steps is 0.00013127, so the final sample average error is lel = 85 ∗ 0.0013127 = 0.112 < 0.5, and use the deep neural network (DNN) to train different parameters to obtain the conversion model, and then synthesize these conversion models, and finally achieve the effect of improving the synthesized sound quality.

Highlights

  • Introduction eInternet is changing the teaching methods of teachers

  • There are still many applications in the research. ese applications are very attractive, which shows the great potential of neural network applications

  • Neural network technology is especially good at processing which has a large number of the sample applications [7]; the problem of pronunciation of English words is solved by using complete learning samples. e problem involves phonemes and phonetic symbols. e entire learning sample involves tens of thousands of English words

Read more

Summary

Classification and Basic Composition of Speech Recognition System

E commonly used window functions in speech signal processing are rectangular window and Hamming window, whose expression is shown in equation (6) (N is the frame length): Rectangular window: 1, 0 ≤ n ≤ (N − 1), W(n) 􏼨. E function of short-term average energy is to distinguish between voiced and unvoiced sounds in speech signals. The speech signal is divided into several frames and windowed, the energy of each frame (the amplitude of each frame) is calculated, and the calculated energy is called the short-term average energy. E short-term average energy E(I) of the speech signal in frame I can be obtained by one of the following three algorithms, in which N is the frame length and Xi(N) is the amplitude energy of the speech message at the NTH point. The error reaches the fault tolerance range

Experiments
Discussion
The3serial n4umber
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call