Abstract

Languages across the world have words, phrases, and behaviors—the taboos—that are avoided in public communication considering them as obscene or disturbing to the social, religious, and ethical values of society. However, people deliberately use these linguistic taboos and other language constructs to make hurtful, derogatory, and obscene comments. It is nearly impossible to construct a universal set of offensive or taboo terms because offensiveness is determined entirely by different factors such as socio-physical setting, speaker-listener relationship, and word choices. In this article, we present a detailed corpus-based study of offensive language in Nepali. We identify and describe more than 18 different categories of linguistic offenses including politics, religion, race, and sex. We discuss 12 common euphemisms, such as synonym, metaphor, and circumlocution. In addition, we introduce a manually constructed dataset of more than 1,000 offensive and taboo terms popular among contemporary speakers. We describe the first experiments that provide baseline results in detecting offensive language in Nepali. This in-depth study of offensive language and resource will provide a foundation for several downstream tasks, such as offensive language detection and language learning.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call