Abstract

AbstractLanguage Identification (LI) is a crucial part of various text-processing pipelines, as most techniques presume that the language of input text is known. Document-level Language Identification has been seen as an almost solved problem in some application areas, but language detectors fail in the case of social media environment due to code-switching, word-borrowing from different languages, phonetic typing; which imply that LI in code-mixed text must be carried out at word-level. Hence, this work focuses on identifying languages at word-level in multilingual environments like social-media. One of the major concerns of these environments is phonetic typing which can be taken into consideration by inculcating graphemic features into our model. Character n-grams take all combination of character occurring together into account resulting in large model size, whereas graphemic features consider only those combinations of characters having some underlying linguistic significance. For example, ‘kh’ and ‘gh’ graphemes occur majorly in languages like Hindi and Urdu in comparison to English. According to our observations in dataset (Sarma et al. in Word level language identification in assamese-bengali-hindi-english code-mixed social media text, pp. 261–266), we have observed that more graphemes (53.46%) are exclusive to a particular language than bigrams (21.38%) or trigrams (39.43%) are. This work consists of detailed analysis and comparison on the basis of several metrics between the character n-gram and grapheme based features by performing experiments using grapheme based features in various popular methods (originally containing only character n-gram features) in place of character n-gram features. Through these set of experiments and our analysis, we show the usefulness of grapheme in the field of word-level LI.KeywordsLanguage identificationCode-mixed textCharacter n-gramGraphemePhonetic typingNatural language processingLinguistics

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call