Speech Corpus Research Articles

Developing an automatic speaker verification (ASV) system for children is extremely challenging due to the unavailability of children’s speech corpora. The challenges are further exacerbated in the case of short utterances. Voice-based biometric systems require adequate amount of speech data for enrollment and verification; otherwise the performance considerably degrades. In this paper, we have focussed on data paucity and preserving the higher-frequency contents in order to enhance the performance of a short-utterance based children’s speaker verification system. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used is from adult speakers which are acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification and voice-conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes. A relative improvement of 33.6% in equal error rate (EER) with respect to the baseline system trained solely on child data-set is achieved when the proposed data augmentation technique is applied. Further to that, for the preservation of the higher-frequency contents, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the linear-frequency cepstral coefficient (LFCC) or with the inverse-Mel-frequency cepstral coefficient (IMFCC) features. The use of Mel-filter-bank leads to poor resolution of higher-frequency components. On the other hand, linear- or inverse-Mel-filter-banks yield better resolution of higher-frequency components. Moreover, MFCC and IMFCC features exhibit low canonical correlation. Consequently, the frame-level concatenation of MFCC and LFCC or IMFCC features leads to better resolution of both lower- as well as higher-frequency components. Therefore, the EER considerably reduces when either LFCC features or IMFCC features are concatenated with MFCC features. The EER for the full test set shows a relative reduction of 10.56% (with respect to the EER for the MFCC features) when IMFCC features are concatenated with the MFCC features. This novel approach of incorporating data augmentation followed by frame-level feature concatenation helps in achieving an overall reduction of 40.6% in EER.

Read full abstract

ObjectiveSpeech recognition technology is widely used as a mature technical approach in many fields. In the study of depression recognition, speech signals are commonly used due to their convenience and ease of acquisition. Though speech recognition is popular in the research field of depression recognition, it has been little studied in somatisation disorder recognition. The reason for this is the lack of a publicly accessible database of relevant speech and benchmark studies. To this end, we introduced our somatisation disorder speech database and gave benchmark results. MethodsBy collecting speech samples of somatisation disorder patients, in cooperation with the Shenzhen University General Hospital, we introduced our somatisation disorder speech database, the Shenzhen Somatisation Speech Corpus (SSSC). Moreover, a benchmark for SSSC using classic acoustic features and a machine learning model was proposed in our work. ResultsTo obtain a more scientific benchmark, we compared and analysed the performance of different acoustic features, i. e., the full ComPare feature set, or only Mel frequency cepstral coefficients (MFCCs), fundamental frequency (F0), and frequency and bandwidth of the formants (F1-F3). By comparison, the best result of our benchmark was the 76.0% unweighted average recall achieved by a support vector machine with formants F1–F3. ConclusionThe proposal of SSSC may bridge a research gap in somatisation disorder, providing researchers with a publicly accessible speech database. In addition, the results of the benchmark could show the scientific validity and feasibility of computer audition for speech recognition in somatization disorders.

Read full abstract

Speech Corpus Research Articles

Related Topics

Articles published on Speech Corpus

Effective preservation of higher-frequency contents in the context of short utterance based children’s speaker verification system

Filled Pauses Produced by Autistic Adults Differ in Prosodic Realisation, but not Rate or Lexical Type

AVbook, a high-frame-rate corpus of narrative audiovisual speech for investigating multimodal speech perception.

Detecting somatisation disorder via speech: introducing the Shenzhen Somatisation Speech Corpus

A socio-pragmatic analysis of the Turkish discourse markers of ‘şey’, ‘yani’, and ‘işte’ based on educational level of speakers

SEHC: A Benchmark Setup to Identify Online Hate Speech in English

An approach to constructing prosodic grammar for Mandarin read speech.

Prosodic Studies on the Spoken Corpus of the Khalkha Mongolian Language: Age and Gender Effects on F0 and Speech Rate

Distributional and Acoustic Characteristics of Filler Particles in German with Consideration of Forensic-Phonetic Aspects

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing

Analysis of the of training and test data distribution for audio series classification

Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners

Particle stacking in Singlish – New data from the National Speech Corpus

A Helium Speech Unscrambling Algorithm Based on Deep Learning

The voice as a material clue: a new forensic Algerian Corpus.

Factors predicting human performance in error annotation for non-native speech corpus

Появата на класове думи в ранната онтогенеза на българския език. Пилотно корпусно изследване

Extended high frequencies for fricative classification in conversational speech

Neural networks’ posterior probability as measure of effects of alcohol on speech

Spectral degradations in the TIMIT, QuickSIN, NU-6, and other popular bandlimited speech materials

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Corpus Research Articles

Related Topics

Articles published on Speech Corpus

Effective preservation of higher-frequency contents in the context of short utterance based children’s speaker verification system

Filled Pauses Produced by Autistic Adults Differ in Prosodic Realisation, but not Rate or Lexical Type

AVbook, a high-frame-rate corpus of narrative audiovisual speech for investigating multimodal speech perception.

Detecting somatisation disorder via speech: introducing the Shenzhen Somatisation Speech Corpus

A socio-pragmatic analysis of the Turkish discourse markers of ‘şey’, ‘yani’, and ‘işte’ based on educational level of speakers

SEHC: A Benchmark Setup to Identify Online Hate Speech in English

An approach to constructing prosodic grammar for Mandarin read speech.

Prosodic Studies on the Spoken Corpus of the Khalkha Mongolian Language: Age and Gender Effects on F0 and Speech Rate

Distributional and Acoustic Characteristics of Filler Particles in German with Consideration of Forensic-Phonetic Aspects

Grammar-Supervised End-to-End Speech Recognition with Part-of-Speech Tagging and Dependency Parsing

Analysis of the of training and test data distribution for audio series classification

Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners

Particle stacking in Singlish – New data from the National Speech Corpus

A Helium Speech Unscrambling Algorithm Based on Deep Learning

The voice as a material clue: a new forensic Algerian Corpus.

Factors predicting human performance in error annotation for non-native speech corpus

Появата на класове думи в ранната онтогенеза на българския език. Пилотно корпусно изследване

Extended high frequencies for fricative classification in conversational speech

Neural networks’ posterior probability as measure of effects of alcohol on speech

Spectral degradations in the TIMIT, QuickSIN, NU-6, and other popular bandlimited speech materials