Embedding Framework for Identifying Ambiguous Words in Code-Mixed Social Media Text

Shashi Shekhar,M.M Sufyan Beg,Dilip Kumar Sharma

doi:10.1109/ic3i46837.2019.9055679

Abstract

Now a day’s text on social media contains codeswitched and code-mixed contents. These contents are widely used by people to express their opinions on any topic in the languages known to them. Her code-mixing technique is analyzed to find the words which can be used both in Hindi and in English, having different contexts. This leads to word sense ambiguity problem as one word can have a different meaning when it used in context to other words in a sentence. As Hindi Roman and English language exhibit word sense ambiguity, and resolving this ambiguity is a current research issue using the machine learning model. Here character embedding features are used for the representation of each word written in code-mixed content. The proposed method was developed for identifying context words by classifying the intent for using the ambiguous word in code mixed sentence. A well-known hierarchical LSTM model is used in the paper for context-based sub-word-level ambiguity detection to identify the language of the word. The work on Language Identification in the code-mixed text using character-based embedding for processing ambiguous word is a novel approach and shows promising results.

Full Text