Comparative Evaluation of Diverse Features in Fluency Evaluation of Spontaneous Speech

Huaijin Deng,Akio Kobayashi,Takehito Utsuro,Hiromitsu Nishizaki

doi:10.1587/transinf.2022edp7047

Abstract

There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparative Evaluation of Diverse Features in Fluency Evaluation of Spontaneous Speech

Abstract

Talk to us

Similar Papers

More From: IEICE Transactions on Information and Systems

Lead the way for us

Journal: IEICE Transactions on Information and Systems	Publication Date: Jan 1, 2023
License type: free

Similar Papers

Comparison of Static and Time-Sequential Features in Automatic Fluency Detection of Spontaneous Speech
Huaijin Deng ... Hiromitsu Nishizaki
-
Huaijin Deng, et. al.Huaijin Deng ... Hiromitsu Nishizaki
18 Nov 2021
18 Nov 2021

Speaker Verification Using Acoustic and Prosodic Features
Utpal Bhattacharjee ... Kshirod Sarmah
Advanced Computing: An International Journal | VOL. 4
Utpal Bhattacharjee, et. al.Utpal Bhattacharjee ... Kshirod Sarmah
31 Jan 2013
Advanced Computing: An International Journal | VOL. 4

Deep Learning for Asphyxiated Infant Cry Classification Based on Acoustic Features and Weighted Prosodic Features
Chunyan Ji ... Xueli Xiao
-
Chunyan Ji, et. al.Chunyan Ji ... Xueli Xiao
01 Jul 2019
01 Jul 2019

Improvement of speaker recognition by combining residual and prosodic features with acoustic features
Shi-Han Chen ... Hsiao-Chuan Wang
-
Shi-Han Chen, et. al. Shi-Han Chen ... Hsiao-Chuan Wang
17 May 2004
17 May 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparative Evaluation of Diverse Features in Fluency Evaluation of Spontaneous Speech

Abstract

Talk to us

Similar Papers

More From: IEICE Transactions on Information and Systems