Abstract

This study presents a smart cognitive sensor “iRecorder” that can spontaneously locate speakers among attendees at a boardroom using ubiquitous arrays of audiovisual sensors. The proposed system “iRecorder” consists of two major components-Sound localization and mouth tracking. For acoustic processing, this work proposes ridge phase-smoothing direction-of-arrival (DOA) estimation, which refines the distorted phase of a signal and robustly determines acoustic directions. During visual detection, this study develops novel Multiregional Histograms of Oriented Gradients (MHOGs) to model an uttering mouth. Unlike HOGs, the proposed feature is no longer limited to fixed-sized windows or blocks. It relies on facial regions. Finally, the system uses a fusion mechanism that integrates both clues from audiovisual sensors based on majority voting to target an actual speaker. The experimental result of DOA estimation showed that the directional errors were successfully improved by 6.6 degree on average. Concerning detection of talking faces, the accuracy reached as high as a rate of 85.19 percent. The fusion test results also supported the effectiveness of the system. Such findings reveal that the proposed system is superior to the other approaches and establishes its feasibility.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call