Abstract

A real-time system combining auditory input received by a microphone array with video input is designed and implemented as a prototype smart video-conferencing system. The audio part of the system uses accurate algorithms for localization of the source in typical noisy office environments, and performs dereverberation and beamforming to enhance signal quality. The coordinates of the sound sources obtained are used to control the pan, tilt and zoom of a video camera. The video-processing part segments people in the image using an obtained background model, and can label and track them and get a close-up view of the head of the talker. Heuristic algorithms for intelligent zoom and tracking of people in video-conferencing scenarios are incorporated into the system. The current implementation is performed on a commercial off-the-shelf personal computer equipped with a data acquisition board and does not require expensive hardware. The speed, precision and robustness of the system are superior to existing ones. Demonstrations of the system in different scenarios will be presented, along with quantitative measures of its performance. The system is designed as a general purpose front-end to systems for speech and person recognition, and for newer human-computer interfaces. [Work supported in part by DARPA.]

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.