Vagueness: Natural Language and Semantics

Carlos Castillo

doi:10.1017/cbo9781316476840.004

Abstract

During the 2015 Nepal earthquake, a 26-year-old Indian lawyer and activist posted the following on Twitter: Media must report about d alleged 20k RSS chaps off 2 #Nepal.here's a pic coz d 1 @ShainaNC shared isn't true.. ;) Meaning: media must report about allegations that twenty thousand volunteers from India's Rashtriya Swayamsevak Sangh (RSS) had joined the relief efforts in Nepal, as falsely claimed on Twitter by Shaina NC (a member of the Bharatiya Janata Party, a political group close to the RSS). This message mixes shortened words (“d” for “the,” “2” for “to,” “coz” for “because,” “pic” for “picture”), ambiguous abbreviations (“RSS,” which may mean a number of things), British slang (“chaps”), platform-specific codes (such as the hashtag #Nepal and the user mention @ShainaNC ), punctuation/capitalization issues (lack of spacing between #Nepal and here , usage of two dots instead of an ellipsis), and sarcasm expressed through a “wink” emoticon (“)”). In general, understanding a message in social media requires contextual information to compensate for fragmented, ambiguous – in otherwords, vague – text that is open to more than one interpretation. This chapter is about Natural Language Processing (NLP), which encompasses computational methods created for dealing with human language. NLP methods incorporating statistical machine learning elements were developed in the 1980s and 1990s using mostly profesionally written texts, such as newspaper articles. Since the late 1990s and the 2000s, these methods have been extended to deal first with Web content, and in the late 2000s and early 2010s, with social media messages and short text messages sent from mobile phones (SMS). Many modern NLP methods are based on machine learning. The next section ( § 3.1) describes the text of social media messages. Then, we outline basic NLP methods such as tokenization, stemming, part-of-speech tagging, and dependency parsing ( § 3.2), as well as sentiment analysis/opinion mining ( § 3.3). Next, we describe how to locate references to entities such as people and organizations ( § 3.4), and, particularly, places ( § 3.5). Finally, we refer to methods for extracting structured data from unstructured text ( § 3.6), and for adding semantics to messages ( § 3.7). Social Media Is Conversational In general on the Internet “we find language that is fragmentary, laden with typographical errors, often bereft of punctuation, and sometimes downright incoherent” (Baron, 2003).

Full Text