Abstract

Estimating speaker height can assist in voice forensic analysis and provide additional side knowledge to benefit automatic speaker identification or acoustic model selection for automatic speech recognition. In this study, a statistical approach to height estimation that incorporates acoustic models within a non-uniform height bin width Gaussian mixture model structure as well as a formant analysis approach that employs linear regression on selected phones are presented. The accuracy and trade-offs of these systems are explored by examining the consistency of the results, location, and causes of error as well a combined fusion of the two systems using data from the TIMIT corpus. Open set testing is also presented using the Multi-session Audio Research Project corpus and publicly available YouTube audio to examine the effect of channel mismatch between training and testing data and provide a realistic open domain testing scenario. The proposed algorithms achieve a highly competitive performance to previously published literature. Although the different data partitioning in the literature and this study may prevent performance comparisons in absolute terms, the mean average error of 4.89 cm for males and 4.55 cm for females provided by the proposed algorithm on TIMIT utterances containing selected phones suggest a considerable estimation error decrease compared to past efforts.

Highlights

  • Recent years have witnessed an increased interest in the use of voice-based biometrics as complementary traits for person identification, access authorization, forensics, and surveillance

  • A typical speaker identification system can identify an individual from a specific set of speakers (Reynolds, 1995) or claim the individual does not belong to the group of target speakers and is out of set (Angkititrakul and Hansen, 2007), or verify whether the speaker is the claimed individual or an impostor (Kinnunen and Li, 2010; Hasan et al, 2013)

  • mean average error (MAE) is the measure chosen to reflect performance of the MODIFIED FORMANT/LINE SPECTRAL FREQUENCY TRACK REGRESSION (MFLTR) method because it has been used in previous studies for height estimation (Ganchev et al, 2010a; Mporas and Ganchev, 2009; Arsikere et al, 2012, 2013; Williams and Hansen, 2013)

Read more

Summary

Introduction

Recent years have witnessed an increased interest in the use of voice-based biometrics as complementary traits for person identification, access authorization, forensics, and surveillance. The acoustic theory of speech production suggests that formant center frequencies are inversely proportional to VTL (Lee and Rose, 1996). In this sense, the strong correlation of VTL with height found in Fitch and Giedd (1999) could be expected to transfer to formant frequencies. Higher formants tend to be more steady and better reflect the actual VTL, a property successfully exploited in parametric VTL normalization techniques in automatic speech recognition (Eide and Gish, 1996)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.