Abstract
Most mainstream Mel-frequency cepstral coefficient (MFCC) based Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, the acoustic landmark theory disagrees with this idea. Acoustic landmark theory exploits the quantal non-linear articulatory-acoustic relationships from human speech perception experiments and provides a theoretical basis of extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conducted experiments, using the TIMIT corpus, on both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others during the recognition process. We proved that altering the level of emphasis on landmark and non-landmark frames, through re-weighting or removing frame acoustic likelihoods accordingly, can change the phone error rate (PER) of the ASR system in a way dramatically different from making similar changes to random frames. Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies achieved a PER increment of 0.44% when only scoring less than half, 41.2% to be precise, of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrated the potential of landmarks for computational reduction for ASR.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.