Where does the beat fall? Speech-beat alignment in Mandarin and English singing

Cong Zhang,Charlotte A Slocombe

doi:10.1121/1.5147478

Abstract

Text-to-sing generates singing from text input (i.e., music score with lyrics), from which only syllable-level speech-music alignment can be acquired. To enhance the text-to-sing models, more fine-grained phoneme-level information is needed. We therefore investigate the acoustic measurements of segments and their temporal relationship with music beats as an answer from a linguistics perspective. Two research questions are addressed: (1) Do beats align with syllable onsets or nuclear vowel onsets? (2) Do different types of consonants present different speech-beat alignment results? Unaccompanied singing by professional singers in two rhythmically dissimilar languages, English (15 songs) and Mandarin (25 songs), were analysed. Data were segmented manually into phonemes by a trained annotator; a music scholar independently labelled the beats. Preliminary results suggest that Mandarin songs strongly favour vowels as anchors for beats (66.7%) while only 52.9% of beats fall on vowels in English. Both languages show that the beats have a strong preference for the end of consonants and the beginning of vowels. Phoneme types also play a significant role in the speech-beat alignment distribution. Future modelling of the speech-beat alignment in singing and comparison with speech rhythm data will also contribute to linguistic rhythm theories.

Full Text