Abstract

Training procedures of a deep neural network are still an area with ample research possibilities and constant improvement either to increase its efficiency or its time performance. One of the lesser-addressed components is its objective function, which is an underlying aspect to consider when there is the necessity to achieve better error rates in the area of automatic speech recognition. The aim of this paper is to present two new variations of the frame-level cost function for training a deep neural network with the purpose of obtaining superior word error rates in speech recognition applied to a case study in Spanish. The first proposed function is a fusion between the boosted cross-entropy and the so called cross-entropy/log-posterior-ratio. The main idea is to jointly emphasize the prediction of difficult/crucial frames provided by a boosting factor and at the same time enlarge the distance between the target senone and its closest competitor. The second proposal is a fusion between the non-uniform mapped cross-entropy and the cross-entropy/log-posterior-ratio. This function utilizes both the mapped function to enhance the frames that have ambiguity in their belonging to specific senones and the log-posterior-ratio with the purpose of separating the target senone against the most competing tied tri-phone state. The proposed approaches are compared against those frame-level cost functions discussed in the state of the art. This comparative has been made by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for the recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of 15.14% and 12.30% is obtained with the two proposed approaches, respectively, against the plain well-established cross-entropy loss function.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.