Abstract

Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.

Highlights

  • Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech

  • Proposed Pipeline + Google Translate (PPGT): In addition to Bing Translate (BT) and GT, we propose a simple pipeline to use translation capabilities of already existing machine translation systems

  • We present a parallel corpus for the English-Hindi code-mixed machine translation task

Read more

Summary

Introduction

Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech. Vyas et al (2014) proposed various experiment to identify POS tags of the 1,062 code-mixed Hindi-English Facebook posts They collected data from three popular celebrity Facebook public pages of Mr Amitabh Bachchan (well-known actor), Mr Shahrukh Khan (well-known actor), and Mr Narendra Modi (current Indian Prime Minister). The proposed evaluation benchmark has six NLP tasks, i.e., language identification, POS tagging, named entity recognition, sentiment analysis, question answering, and natural language inference These tasks have been part of the recently shared tasks co-located with various NLP conferences or the latest research works. Dhar et al (2018) propose a machine translation augmentation pipeline to use on top of the standard machine translation systems They create a parallel corpus of 6,096 English-Hindi codemixed sentences and their corresponding translation in English. We discuss various limitations of the corpus and the research opportunities

Code-Mixing and Challenges in Machine Translation
Dataset
Annotation
Exploratory Analysis
Degree of Code-mixing
Message Length
Frequent words
Evaluation of Machine Translation Systems
Limitations and Opportunities
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.