Abstract

In this paper, we describe a workflow for the data-driven acquisition and semantic scaling of a lexicon that covers lexical items from the lower end of the German language register—terms typically considered as rough, vulgar or obscene. Since the fine semantic representation of grades of obscenity can only inadequately be captured at the categorical level (e.g., obscene vs. non-obscene, or rough vs. vulgar), our main contribution lies in applying best-worst scaling, a rating methodology that has already been shown to be useful for emotional language, to capture the relative strength of obscenity of lexical items. We describe the empirical foundations for bootstrapping such a low-end lexicon for German by starting from manually supplied lexicographic categorizations of a small seed set of rough and vulgar lexical items and automatically enlarging this set by means of distributional semantics. We then determine the degrees of obscenity for the full set of all acquired lexical items by letting crowdworkers comparatively assess their pejorative grade using best-worst scaling. This semi-automatically enriched lexicon already comprises 3,300 lexical items and incorporates 33,000 vulgarity ratings. Using it as a seed lexicon for fully automatic lexical acquisition, we were able to raise its coverage up to slightly more than 11,000 entries.

Highlights

  • With the rapid diffusion of social media in our daily lives, we currently experience a fundamental change of social communication habits

  • Since a broad-coverage lexicon of obscene German is missing, we decided on a weakly supervised approach to lexicon acquisition based on bootstrapping. It consists of the following steps (the over-all workflow is fundamentally inspired by the work of Wiegand et al (2018a), yet complements it by a hitherto unexplored methodology to scale the degree of obscenity of lexical items based on bestworst scaling): 1. Language Resources: Select a seed lexicon which contains a collection of lexical items already tagged as rough and vulgar

  • We are concerned with the lexical segment at the lower stylistic end of each natural language often referred to as rough, vulgar and obscene

Read more

Summary

Introduction

With the rapid diffusion of social media in our daily lives, we currently experience (and many of us foster) a fundamental change of social communication habits. The chance for malicious interactions is promoted by the sheer mass of players in these networks and easy ways of hiding real individual identities via nick names or technically slightly more advanced means of camouflage, such as fake Web identities, including non-benevolent software agents and chatbots (McIntire et al, 2010) These promiscuous communication groups face a high risk of anti-social behavior by aggressive, ruthless or entirely hostile actors (Dadvar et al, 2014; Wester et al, 2016; Li et al, 2017b; Talukder and Carbunar, 2018). The standard way to deal with this challenge is to define category systems (binary ones, such as obscene vs non-obscene, or staged ones, as illustrated by pejorative vs rough vs vulgar) and letting people decide on the assignment of lexical items to these discrete categories Once such categorical features are available, these lexical resources can be exploited for analytic purposes. We come up with VULGER, a lexicon of VULgar GERman, totalling slightly more than 11,000 entries

Related Work
Lexicon Acquisition Method
Machine Learning
Building the Seed Lexicon
Enriching the Seed Lexicon
Regression Models
Applying Regression Models to Enhance the Lexicon
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call