Abstract
Non-audible murmur (NAM) is an unvoiced speech signal that can be received through the body tissue with the use of special acoustic sensors (i.e., NAM microphones) attached behind the talker's ear. The authors had previously reported experimental results for NAM recognition using a stethoscopic and a silicon NAM microphone. Using a small amount of training data from a single speaker and adaptation approaches, 93.9% of word accuracy was achieved for a 20 k Japanese vocabulary dictation task. In this paper, further analysis of NAM speech is made using distance measures between hidden Markov models (HMMs). It has been shown that owing to the reduced spectral space of NAM speech, the HMM distances are also reduced when compared with those of normal speech. In the case of Japanese vowels and fricatives, the distance measures in NAM speech follow the same relative inter-phoneme relationship as that in normal speech without significant differences. However, significant differences have been found in the case of Japanese plosives. More specifically, in NAM speech, the distances between voiced/unvoiced consonant pairs articulated in the same place drastically decreased. As a result, the inter-phoneme relationship as compared to normal-speech changed significantly, causing a substantial decrease in the recognition accuracy. A speaker-dependent phoneme recognition experiment has been conducted, obtained 81.5% NAM phoneme correct, showing a relationship between HMM distance measures and phoneme accuracy. In a NAM microphone, body transmission and loss of lip radiation act as a low-pass filter. As a result, higher frequency components are attenuated in a NAM signal. Because of spectral reduction, NAM's unvoiced nature, and the type of articulation, NAM sounds become similar, causing a larger number of confusions when compared with normal speech. Yet many of those sounds are visually different on face/mouth/lips, and the integration of visual information increases their discrimination. As a result, recognition accuracy increases as well. In this article, the visual information extracted from the talkers' facial movements is fused with NAM speech. The experimental results reveal a relative improvement of 10.5% on average when fused NAM speech and facial information were used compared with using only NAM speech.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Audio, Speech, and Language Processing
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.