End-to-end child-adult speech diarization in naturalistic conditions of preschool classrooms

Prasanna V Kothalkar,John H Hansen,Jay Buzhardt,Dwight Irvin

doi:10.1121/10.0018568

Abstract

Speech and language development are early indicators of overall analytical and learning ability in pre-school children. Early childhood researchers are interested in analyzing naturalistic versus controlled lab recordings to assess both quality and quantity of such communication interactions between children and adults/teachers. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to diversity of acoustic events/conditionsin daylong audio streams, automated speaker diarization technology is limited and must be advanced to address this challenging domain for audio segmentation and meta-data information extraction. We investigate a Deep Learning-based diarization solution for segmenting classroom interactions of 3–5 year-old children engaging with teachers. Here, the focus is on speaker-label diarization which classifies speech segments as belonging to either Adults or Children, partitioned across multiple classrooms. Our proposed ECAPA-TDNN model achieves a best F1-score of 65.5% on data from two classrooms, based on open dev and test sets for each classroom. Also, F1-scores for individual speaker labels provide a breakdown of performance across naturalistic child classroom engagement. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults.

Full Text