Lexical variation in English language podcasts, editorial media, and social media

Jussi Karlgren

doi:10.3384/nejlt.2000-1533.2022.3566

Abstract

   The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.  

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Northern European Journal of Language Technology	Publication Date: Aug 11, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Lexical variation in English language podcasts, editorial media, and social media

Abstract

Talk to us

Similar Papers

More From: Northern European Journal of Language Technology

Lead the way for us

Similar Papers

Automatic Identification of Locative Expressions from Social Media Text
Fei Liu ... Maria Vasardani
-
Fei Liu, et. al.Fei Liu ... Maria Vasardani
03 Nov 2014
03 Nov 2014

Big Data Study for Gluten-Free Foods in India and USA Using Online Reviews and Social Media
Jolly Masih ... Jonathan Deutsch
Agricultural Sciences | VOL. 10
Jolly Masih, et. al.Jolly Masih ... Jonathan Deutsch
01 Jan 2019
Agricultural Sciences | VOL. 10

Mixed Media: Interactions of Social and Traditional Media in Political Decision Making
Scott P Robertson ... Misa Maruyama
-
Scott P Robertson, et. al.Scott P Robertson ... Misa Maruyama
01 Jan 2013
01 Jan 2013

Place meanings and national parks: A rhetorical analysis of social media texts
Camille Marcotte ... Patricia A Stokowski
Journal of Outdoor Recreation and Tourism | VOL. 35
Camille Marcotte, et. al.Camille Marcotte ... Patricia A Stokowski
02 Apr 2021
Journal of Outdoor Recreation and Tourism | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lexical variation in English language podcasts, editorial media, and social media

Abstract

Talk to us

Similar Papers

More From: Northern European Journal of Language Technology