Abstract
Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.
Highlights
Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech
Proposed Pipeline + Google Translate (PPGT): In addition to Bing Translate (BT) and GT, we propose a simple pipeline to use translation capabilities of already existing machine translation systems
We present a parallel corpus for the English-Hindi code-mixed machine translation task
Summary
Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech. Vyas et al (2014) proposed various experiment to identify POS tags of the 1,062 code-mixed Hindi-English Facebook posts They collected data from three popular celebrity Facebook public pages of Mr Amitabh Bachchan (well-known actor), Mr Shahrukh Khan (well-known actor), and Mr Narendra Modi (current Indian Prime Minister). The proposed evaluation benchmark has six NLP tasks, i.e., language identification, POS tagging, named entity recognition, sentiment analysis, question answering, and natural language inference These tasks have been part of the recently shared tasks co-located with various NLP conferences or the latest research works. Dhar et al (2018) propose a machine translation augmentation pipeline to use on top of the standard machine translation systems They create a parallel corpus of 6,096 English-Hindi codemixed sentences and their corresponding translation in English. We discuss various limitations of the corpus and the research opportunities
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.