This study addresses a framework for a robot audition system, including sound source localization (SSL) and sound source separation (SSS), that can robustly recognize simultaneous speeches in a real environment. Because SSL estimates not only the location of speakers but also the number of speakers, such a robust framework is essential for simultaneous speech recognition. Moreover, improvement in the performance of SSS is crucial for simultaneous speech recognition because the robot has to recognize the individual source of speeches. For simultaneous speech recognition, current robot audition systems mainly require noise-robustness, high resolution, and real-time implementation. Multiple signal classification (MUSIC) based on standard Eigenvalue decomposition (SEVD) and Geometric-constrained high-order decorrelation-based source separation (GHDSS) are techniques utilizing microphone array processing, which are used for SSL and SSS, respectively. To enhance SSL robustness against noise while detecting simultaneous speeches, we improved SEVD-MUSIC by incorporating generalized Eigenvalue decomposition (GEVD). However, GEVD-based MUSIC (GEVD-MUSIC) and GHDSS mainly have two issues: (1) the resolution of pre-measured transfer functions (TFs) determines the resolution of SSL and SSS and (2) their computational cost is expensive for real-time processing. For the first issue, we propose a TF-interpolation method integrating time-domain-based and frequency-domain-based interpolation. The interpolation achieves super-resolution robot audition, which has a higher resolution than that of the pre-measured TFs. For the second issue, we propose two methods for SSL: MUSIC based on generalized singular value decomposition (GSVD-MUSIC) and hierarchical SSL (H-SSL). GSVD-MUSIC drastically reduces the computational cost while maintaining noise-robustness for localization. In addition, H-SSL reduces the computational cost by introducing a hierarchical search algorithm instead of using a greedy search for localization. These techniques are integrated into a robot audition system using a robot-embedded microphone array. The preliminary experiments for each technique showed the following: (1) The proposed interpolation achieved approximately 1-degree resolution although the TFs are only at 30-degree intervals in both SSL and SSS; (2) GSVD-MUSIC attained 46.4 and 40.6% of the computational cost compared to that of SEVD-MUSIC and GEVD-MUSIC, respectively; (3) H-SSL reduced 71.7% of the computational cost to localize a single speaker. Finally, the robot audition system, including super-resolution SSL and SSS, is applied to robustly recognize four sources of speech occurring simultaneously in a real environment. The proposed system showed considerable performance improvements of up to 7% for the average word correct rate during simultaneous speech recognition, especially when the TFs were at more than 30-degree intervals.
Read full abstract