There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.
Read full abstract