Abstract

Abstract. This work addresses the problem of detecting the speaker on audio-visual sequences by evaluating the synchrony between the audio and video sig-nals. Prior to the classification, an information theoretic framework is applie d toextract optimized audio features using video information. The classificatio n stepis then defined through a hypothesis testing framework so as to get confid encelevels associated to the classifier outputs. Such an approach allows to evalu atethe whole classification process efficiency, and in particular, to evaluate th e ad-vantage of performing or not the feature extraction. As a result, it is shown thatintroducing a feature extraction step prior to the classification increases the abilityof the classifier to produce good relative instance scores. 1 Introduction This work addresses the problem of detecting the current speaker among two candidatesin an audio-video sequence, using a single camera and microphone. To this end, thedetection process has to consider both the audio and video clues as well as their inter-relationship to come up with a decision. In particular, previous works in the domain haveshown that the evaluation of the synchrony between the two modalities, interpreted asthe degree of mutual information between the signals, allowed to recover the commonsource of the two signals, that is, the speaker [1], [2].Other works, such as [3] and [4], have pointed out that fusing the information con-tained in each modality at the feature level can greatly help the classification task: thericher and the more representative the features, the more efficient the classifier. Using aninformation theoretic framework based on [3] and [4], audio features specific to speechare extracted using the information content of both the audio and video signals as a pre-liminary step for the classification. Such an approach and it s advantages have alreadybeen described in details in [5]. This feature extraction step is followed by a classifi-cation step, where a label ”speaker” or ”non-speaker” is ass igned to pairs of audio andvideo features. The definition of this classification step co nstitutes the contribution ofthis work.As stated previously, the classifier decision should rely on an evaluation of the syn-chrony between pairs of audio and video features. In [4], the authors formulate the

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.