Abstract

Mixed script identification is a hindrance for automated natural language processing systems. Mixing cursive scripts of different languages is a challenge because NLP methods like POS tagging and word sense disambiguation suffer from noisy text. This study tackles the challenge of mixed script identification for mixed-code dataset consisting of Roman Urdu, Hindi, Saraiki, Bengali, and English. The language identification model is trained using word vectorization and RNN variants. Moreover, through experimental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM), Bidirectional LSTM, Gated Recurrent Unit (GRU), and Bidirectional Gated Recurrent Unit (Bi-GRU). Experimentation achieved the highest accuracy of 90.17 for Bi-GRU, applying learned word class features along with embedding with GloVe. Moreover, this study addresses the issues related to multilingual environments, such as Roman words merged with English characters, generative spellings, and phonetic typing.

Highlights

  • Code-mixing is defined as “the embedding of linguistic components such as phrases, words, and lexemes from one language into an expression from another language.” Codemixing refers to the use of linguistic units’ words, phrases, clauses from different languages at a sentence level

  • Because they are completely grammaticalized, fused lects allow for less variety than a mixed language because of their semantics and pragmatics. e grammar of the fused lect determines which source-language parts may be included in the fusion

  • All the datasets are divided into two divisions of the training set and testing set for training of the Long Short-Term Memory (LSTM) network

Read more

Summary

Introduction

Code-mixing is defined as “the embedding of linguistic components such as phrases, words, and lexemes from one language into an expression from another language.” Codemixing refers to the use of linguistic units’ words, phrases, clauses from different languages at a sentence level. Code-mixing is defined as “the embedding of linguistic components such as phrases, words, and lexemes from one language into an expression from another language.”. Instead of switching codes at semantically or sociolinguistically significant points, this code-mixing has no particular value in the immediate context. Because they are completely grammaticalized, fused lects allow for less variety than a mixed language because of their semantics and pragmatics. E grammar of the fused lect determines which source-language parts may be included in the fusion. It is observed in an informal setting, like social media commonly. Social media users often utilize mixed scripts of Roman text

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call