Abstract
Computational auditory scene analysis is increasingly presented in the literature as a set of auditory-inspired techniques for estimating “Ideal Binary Masks” (IBM), i.e., time-frequency domain segregations of the attended source and the acoustic background based on a local signal-to-noise ratio objective (Wang and Brown, 2006). This talk argues that although IBMs may be a useful stand-in when evaluating signal-processing systems, they can provide a misleading perspective when considering models of auditory cognition. First, there is no evidence that human cognition computes or requires an explicit binary mask representation (ideal or otherwise). Second, evaluation of an IBM requires artificially-mixed acoustic scenes in order to provide access to the ground truth mask. It is possible that systems that work well on artificially mixed acoustic scenes will fail to generalize to real data. The danger in predicting real performance from results obtained on artificial mixtures is seen in an analysis of systems submitted to the recent CHiME distant microphone speech recognition challenges which evaluates on both types of data (http://spandh.dcs.shef.ac.uk/chime). It is argued that rather than presume specific internal representations, auditory scene analysis systems can be best evaluated by direct comparison of human and machine percepts, e.g., in the case of a speech recognition task, comparison of human and machine transcriptions at a phonetic level.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.