American English Speakers Research Articles

It has been shown that, in language comprehension, listeners model certain attributes of their interlocutor (e.g., dialectic background, age, gender) and interpret speech against that model; for example, they understand cross-dialectally ambiguous words such as flat and gasfor their American English (AE) meanings more often when listening to an AE interlocutor than a British English (BE) interlocutor. This study further investigated whether listeners construct concurrent interlocutor models when communicating with interleaved interlocutors of different dialectic backgrounds, and, if they do, how they choose between concurrent models to interpret words. In two experiments, participants heard a word (e.g., flat) spoken by a BE or AE interlocutor and provided a word associate (indicating which meaning of the word was accessed). When different interlocutors were encountered in separate blocks, participants accessed more AE meanings when listening to an AE rather than a BE interlocutor, and the accent effect was not larger for words pronounced more differently in BE and AE (e.g., fall sounds more distinctly British vs. American than flat does). These results suggest that participants constructed an interlocutor model (e.g., of a BE or an AE speaker) and used it (instead of accent details in a word) to guide word meaning access. When interlocutors were interleaved in the same block, we observed a comparable accent effect, which increased as a function of between-accent differences in pronunciation. These results suggest that participants constructed concurrent interlocutor models and used accent details in a word to select the appropriate interlocutor model. We also observed that the accent effect was comparable for two interleaved interlocutors of the same gender (e.g., a female BE interlocutor and a female AE interlocutor) and for two interleaved interlocutors of different genders (e.g., a female BE interlocutor and a male AE interlocutor). These results suggest that participants did not use gender-related voice details for model selection when accent details were sufficient for interlocutor model selection.

Read full abstract

In recent decades, computational approaches to sociophonetic vowel analysis have been steadily increasing, and sociolinguists now frequently use semi-automated systems for phonetic alignment and vowel formant extraction, including FAVE (Forced Alignment and Vowel Extraction, Rosenfelder et al., 2011; Evanini et al., Proceedings of Interspeech, 2009), Penn Aligner (Yuan and Liberman, J. Acoust. Soc. America, 2008, 123, 3878), and DARLA (Dartmouth Linguistic Automation), (Reddy and Stanford, DARLA Dartmouth Linguistic Automation: Online Tools for Linguistic Research, 2015a). Yet these systems still have a major bottleneck: manual transcription. For most modern sociolinguistic vowel alignment and formant extraction, researchers must first create manual transcriptions. This human step is painstaking, time-consuming, and resource intensive. If this manual step could be replaced with completely automated methods, sociolinguists could potentially tap into vast datasets that have previously been unexplored, including legacy recordings that are underutilized due to lack of transcriptions. Moreover, if sociolinguists could quickly and accurately extract phonetic information from the millions of hours of new audio content posted on the Internet every day, a virtual ocean of speech from newly created podcasts, videos, live-streams, and other audio content would now inform research. How close are the current technological tools to achieving such groundbreaking changes for sociolinguistics? Prior work (Reddy et al., Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75) showed that an HMM-based Automated Speech Recognition system, trained with CMU Sphinx (Lamere et al., 2003), was accurate enough for DARLA to uncover evidence of the US Southern Vowel Shift without any human transcription. Even so, because that automatic speech recognition (ASR) system relied on a small training set, it produced numerous transcription errors. Six years have passed since that study, and since that time numerous end-to-end automatic speech recognition (ASR) algorithms have shown considerable improvement in transcription quality. One example of such a system is the RNN/CTC-based DeepSpeech from Mozilla (Hannun et al., 2014). (RNN stands for recurrent neural networks, the learning mechanism for DeepSpeech. CTC stands for connectionist temporal classification, the mechanism to merge phones into words). The present paper combines DeepSpeech with DARLA to push the technological envelope and determine how well contemporary ASR systems can perform in completely automated vowel analyses with sociolinguistic goals. Specifically, we used these techniques on audio recordings from 352 North American English speakers in the International Dialects of English Archive (IDEA 1 ), extracting 88,500 tokens of vowels in stressed position from spontaneous, free speech passages. With this large dataset we conducted acoustic sociophonetic analyses of the Southern Vowel Shift and the Northern Cities Chain Shift in the North American IDEA speakers. We compared the results using three different sources of transcriptions: 1) IDEA’s manual transcriptions as the baseline “ground truth”, 2) the ASR built on CMU Sphinx used by Reddy et al. (Proceedings of the North American Association for Computational Linguistics 2015 Conference, 2015b, 71–75), and 3) the latest publicly available Mozilla DeepSpeech system. We input these three different transcriptions to DARLA, which automatically aligned and extracted the vowel formants from the 352 IDEA speakers. Our quantitative results show that newer ASR systems like DeepSpeech show considerable promise for sociolinguistic applications like DARLA. We found that DeepSpeech’s automated transcriptions had significantly fewer character error rates than those from the prior Sphinx system (from 46 to 35%). When we performed the sociolinguistic analysis of the extracted vowel formants from DARLA, we found that the automated transcriptions from DeepSpeech matched the results from the ground truth for the Southern Vowel Shift (SVS): five vowels showed a shift in both transcriptions, and two vowels didn’t show a shift in either transcription. The Northern Cities Shift (NCS) was more difficult to detect, but ground truth and DeepSpeech matched for four vowels: One of the vowels showed a clear shift, and three showed no shift in either transcription. Our study therefore shows how technology has made progress toward greater automation in vowel sociophonetics, while also showing what remains to be done. Our statistical modeling provides a quantified view of both the abilities and the limitations of a completely “hands-free” analysis of vowel shifts in a large dataset. Naturally, when comparing a completely automated system against a semi-automated system involving human manual work, there will always be a tradeoff between accuracy on the one hand versus speed and replicability on the other hand [Kendall and Joseph, Towards best practices in sociophonetics (with Marianna DiPaolo), 2014]. The amount of “noise” that can be tolerated for a given study will depend on the particular research goals and researchers’ preferences. Nonetheless, our study shows that, for certain large-scale applications and research goals, a completely automated approach using publicly available ASR can produce meaningful sociolinguistic results across large datasets, and these results can be generated quickly, efficiently, and with full replicability.

Read full abstract

American English Speakers Research Articles

Related Topics

Articles published on American English Speakers

Response Cries or Response Statements? A Cross-Linguistic Analysis of Interjectional Expressions in Japanese and English

Interlocutor modelling in comprehending speech from interleaved interlocutors of different dialectic backgrounds.

L2 comprehension of filled pauses and fillers in unscripted speech

Locational pointing in Murrinhpatha, Gija, and English conversations

Effects of task complexity on L2 suggestions

CROSS-CULTURAL STUDY OF REFUSAL STRATEGIES OF AMERICAN AND ARMENIAN ENGLISH SPEAKERS

"I don't Think These Devices are Very Culturally Sensitive."-Impact of Automated Speech Recognition Errors on African Americans.

Female university teachers’ realizations of the speech act of refusal

Social acquisition context matters: Increased neural responses for native but not nonnative taboo words

Lexical preference in second dialect acquisition in a second language

The Caregiver Contribution to Heart Failure Self-care Instrument: Further Psychometric Testing in a European Sample.

Spatiotemporal coordination in word-medial stop-lateral and s-stop clusters of American English.

Acoustic characteristics of Korean-English bilingual speakers’ /l/ and the relationship to their foreign accent ratings.

The tongue dorsum activity in children with cleft palate and typically developing children

Judgments of self-identified gay and heterosexual male speakers of American English: Which vowels do listeners rely on to form their sexual orientation judgments?

Automated vowel space measurement of young children

Comparative Study of Voice Onset Time in English Word-Initial Stop Consonants Produced by Uzbek and American Speakers of English

Semantic prosody of extended lexical units: A case study

Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA.

PRAGMATIC COMPETENCE OF THE YEMENI EFL LEARNERS

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

American English Speakers Research Articles

Related Topics

Articles published on American English Speakers

Response Cries or Response Statements? A Cross-Linguistic Analysis of Interjectional Expressions in Japanese and English

Interlocutor modelling in comprehending speech from interleaved interlocutors of different dialectic backgrounds.

L2 comprehension of filled pauses and fillers in unscripted speech

Locational pointing in Murrinhpatha, Gija, and English conversations

Effects of task complexity on L2 suggestions

CROSS-CULTURAL STUDY OF REFUSAL STRATEGIES OF AMERICAN AND ARMENIAN ENGLISH SPEAKERS

"I don't Think These Devices are Very Culturally Sensitive."-Impact of Automated Speech Recognition Errors on African Americans.

Female university teachers’ realizations of the speech act of refusal

Social acquisition context matters: Increased neural responses for native but not nonnative taboo words

Lexical preference in second dialect acquisition in a second language

The Caregiver Contribution to Heart Failure Self-care Instrument: Further Psychometric Testing in a European Sample.

Spatiotemporal coordination in word-medial stop-lateral and s-stop clusters of American English.

Acoustic characteristics of Korean-English bilingual speakers’ /l/ and the relationship to their foreign accent ratings.

The tongue dorsum activity in children with cleft palate and typically developing children

Judgments of self-identified gay and heterosexual male speakers of American English: Which vowels do listeners rely on to form their sexual orientation judgments?

Automated vowel space measurement of young children

Comparative Study of Voice Onset Time in English Word-Initial Stop Consonants Produced by Uzbek and American Speakers of English

Semantic prosody of extended lexical units: A case study

Advances in Completely Automated Vowel Analysis for Sociophonetics: Using End-to-End Speech Recognition Systems With DARLA.

PRAGMATIC COMPETENCE OF THE YEMENI EFL LEARNERS