Effective word count estimation for long duration daily naturalistic audio recordings

Ali Ziaei,Abhijeet Sangwan,John H.L Hansen

doi:10.1016/j.specom.2016.07.007

Abstract

The ability to count words in extended audio sequences allows researchers to explore characteristics of speakers (i.e., leading, following, task responsibility, personal engagement), as well as the dynamics of two-way or multi-subject conversation scenarios. As such, counting the number of words spoken by a person, offers a rich information source for several applications such as health monitoring (e.g., Autism, Parkinson’s, Alzheimer’s and etc), second language learning, or language development studies. However, developing robust word count systems that can achieve high performance with low computational cost is very challenging due to the uncertain and dynamic behavior experienced in audio recordings. In this study, we address the problem for large-scale naturalistic audio recordings based on a 100-day audio collection entitled (i.e., Prof-Life-Log). This corpus contains continuously recorded audio from one person using a mobile LENA audio recording device (LENA, 2015). The device captures audio for an entire workday which can last up to 16 hours. Our proposed framework to address word count consists of five main components, (i) Speech Activity Detection(SAD) to remove non-speech parts of the signal, (ii) Speech Enhancement to suppress the effects of background noise, (iii) Primary vs. Secondary Speaker Detection to remove secondary speaker segments, (iv) Syllable Rate Estimation to estimate the syllable rate for the primary speaker, and (v) Linear Minimum Mean Square Error Estimation (LMMSE) to find the linear mapping between syllable rate and word rate in spontaneous speech. In spite of the simplicity of the framework, it shows to be very effective in real scenarios with good performance on various datasets. As an indication of performance, the error of the framework for an entire 16 h day audio file can be as low as 1% in terms of cumulative Word Count Error.

Full Text