Abstract
•Acoustic signal: Segmentation into speech and speech pauses • The ESMERALDA speech recognizer is used to detect voice activity more robustly than an approach that is solely based on signal energy. •Visual signal: Segmentation into motion peaks •A peak ranges between two local minima in the amount of changed pixels in the visual signal. • The amount of changed pixels is calculated by summing up a motion history image at each time step. • Temporal association: Overlapping speech and visual segments are associated to one acoustic package.
Highlights
Acoustic packaging makes use of the synchrony between the visual and audio modality in order to detect temporal structure in actions that are demonstrated to children and robots [1]
The ESMERALDA speech recognizer is used to detect voice activity more robustly than an approach that is solely based on signal energy
The amount of changed pixels is calculated by summing up a motion history image at each time step
Summary
Lars Schillingmann1 – Petra Wagner2 – Christian Munier Britta Wrede4 – Katharina Rohlfing. Figure: A test subject showing how to stack cups to an infant
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.