Abstract

In this paper, we tackle the problem of speech enhancement from two fronts: speech modeling and multisensory input. We present a new speech model based on statistics of magnitude-normalized complex spectra of speech signals. By performing magnitude normalization, we are able to get rid of huge intra- and inter-speaker variation in speech energy and to build a better speech model with a smaller number of Gaussian components. To deal with real-world problems with multiple noise sources, we propose to use multiple heterogeneous sensors, and in particular, we have developed microphone headsets that combine a conventional air microphone and a bone sensor. The bone sensor makes direct contact with the speaker’s temple (area behind the ear), and captures the vibrations of the bones and skin during the process of vocalization. The signals captured by the bone microphone, though distorted, contain useful audio information, especially in the low frequency range, and more importantly, they are very robust to external noise sources (stationary or not). By fusing the bone channel signals with the air microphone signals, much improved speech signals have been obtained.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.