Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.
Read full abstract